mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
224 stars 30 forks source link

TEcount and 10X Genomics single Cell data support? #60

Closed SolKatzman closed 4 years ago

SolKatzman commented 4 years ago

This is a wish list item.

Have you given any thought to TEcount support for 10X Genomics single cell data?

As single-cell work becomes more commonplace, it would be nice to incorporate TEcount analysis.

As you may know, the 10X 3' chemistry generates read sequences (in Illumina R2 fastq) associated with cellId and UMI (unique molecular index) barcode data (in Illumina R1 fastq).

The standard 10X cellranger analysis generates counts for genes from a gene model, with duplicates removed and each distinct cell getting its own count. Along the way a STAR-generated position sorted bam is produced (I believe with tags indicating the cell ID).

At a minimum, the STAR parameters would probably have to be modified to allow for highly multi-mapped reads to be output. But beyond that, some clever coding would be necessary to assign UMI-aware, dup-removed counts of TE_rmsk elements to the cellIds. For each TE, a row vector would have to be added to the standard (gene-model based) matrix of counts.

I am curious to know if this is anything you have plans for.

Thanks, Sol Katzman UCSC Genomics Institute.

olivertam commented 4 years ago

Hi Sol,

We are definitely working on this, and currently optimizing the algorithm. It will probably be a separate software under our TEToolkit suite, so keep watching this space. Thanks also for your input. I have passed them onto the people working on this.

All the best.