mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
229 stars 30 forks source link

Losing many reads after quantification with TEcount #193

Closed sanhe374 closed 4 months ago

sanhe374 commented 4 months ago

When running TEcount on my aligned files (aligned using) STAR and important the count data into R, I find that a lot of reads have not been counted from the data based on the 30-40% smaller library size as compared to the aligned files. I am wondering a bit of this is common, that some aligned reads appear to not be counted and why that is? I am working on total-RNAswq data and the alignment rate is >90% for my samples.

olivertam commented 4 months ago

Hi,

Thanks for your interest in the software. Could you elaborate on "a lot of reads have not been counted"? Are you basing that count on the number of alignments in your BAM file, or the number of reads (based on distinct read names)? If the latter, what gene and TE annotations are you using? It is possible there is a mismatch between the genome build and the annotation, which could drive down the number of reads. There is also the possibility that you have a lot of structural RNA (e.g. tRNA, rRNA) in your library. Those won't be annotated to either genes or TE in most cases. If you want to provide the log output of your run, it might provide some clue as well.

Thanks.

sanhe374 commented 4 months ago

Thank you for your quick reply. I aligned the files using STAR. I created an index using these two files GCA_000001405.15_GRCh38_no_alt_analysis_set.fna and hg38.refGene.gtf. After the alignment I got an alignment rate >90%. However, in TEcount I see that I have used GCA_000001405.15_GRCh38_no_alt_analysis_set.fna and hg38.refGene.gtf and hg38_rmsk_TE.gtf (https://github.com/mhammell-laboratory/TEtranscripts/issues/173).

Could this explain why I lose some counts?

I do not think I got a log file but only the count table as an output.

olivertam commented 4 months ago

Hi,

The log output is what the program outputs during its run. If you're running it on a cluster, it would go into a file for the STDERR. Otherwise, it would just print to your screen. Could you provide the command line you used in TEcount? One thing that you might want to also consider is that there are a lot of "transcription" of genomic regions where it's unclear if it's really a gene. Annotations such as Ensembl (and by extension GENCODE) tend to be highly inclusive, whereas refGene may or may not treat them as such (as they might not have sufficient level of evidence).

Thanks.

sanhe374 commented 4 months ago

Here is the command line I have used TEcount -b $i --GTF $genome_anno --TE $anno_te --stranded reverse --project HMATEcount$sample_name --outdir $outpath/TEcount --mode multi. I do not find any log output and it was nothing printed on the screen.

I would like to create a STAR index using the same annotations used in TEcounts, do I need to combine these two gtf files used in TEcounts to create the STAR index?

olivertam commented 4 months ago

Hi,

Thanks for your response. You typically do not need to add the TE GTF to the STAR index for bulk RNA seq, as there typically isn't much splicing information in the TE GTF (which is what STAR needs the GTF to build their known splice junction database). Without a deep dive into your data, I can't immediately explain your observation. Please confirm that your library is reverse-stranded, but otherwise, there are probably a lot of "transcripts" that aren't part of your gene or TE annotation, and thus mappable but not annotatable.

Thanks.

sanhe374 commented 4 months ago

Yes, the library is reverse stranded (read 1 is from the reverse complement to the transcript and read 2 is from the original transcript.

You do not think there is an issue that I built the STAR index on another gene annotation (still with chr) and then used another one for the TE quantification? The gene annotation I have used for TEcount is much larger it seems than the one I used for the indexing so it could be that it more "inclusive".

olivertam commented 4 months ago

Hi,

The fact that you got 90% mapping rate suggests that the STAR index served its purpose (as you would typically lose alignment otherwise). I just looked through some of our datasets, and we also see around 30-40% reduction in annotated vs total mapped reads. I think this is inherent in what we are capturing from total RNA-seq, where there are a good chunk of things (e.g. structural RNA, short promoter-associated transcripts, pseudogenes etc) that are not typically included in gene or TE annotations.

Thanks.

sanhe374 commented 4 months ago

Thank you very much for all your help. It is very valuable to be able to actually communicate with you guys that make these tools.