mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

No TE being quantified #160

Closed conery closed 6 months ago

conery commented 7 months ago

Hi --

I'm trying to run TEtranscripts via singularity on my local cluster. According to the log file it looks like the program is finding the two GTF files, but then it prints an error message when it opens the BAM file:

INFO  @ Tue, 07 Nov 2023 11:21:26: Processing GTF files ... 

INFO  @ Tue, 07 Nov 2023 11:21:26: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
INFO  @ Tue, 07 Nov 2023 11:22:46: Done building gene index ...... 

INFO  @ Tue, 07 Nov 2023 11:22:46: Building TE index ....... 

INFO  @ Tue, 07 Nov 2023 11:22:48: Done building TE index ...... 

INFO  @ Tue, 07 Nov 2023 11:22:48: 
Reading sample file ... 

[E::idx_find_and_load] Could not retrieve index file for '/bam/1__herm_nohs1_S1_L008_R1_001.fastq.gzAligned.out.bam'

After it prints the message the program keeps running:

1000000 alignments  processed. 
3000000 alignments  processed. 
4000000 alignments  processed. 
...

But in the end it says it the TE counts are 0:

uniq te counts = 0 
.......start iterative optimization .......... 
TE counts total 0.0 
Gene counts total 24803976.43102119 

FWIW I tried using TEcount on just one of the BAM files but I'm getting the same behavior.

olivertam commented 7 months ago

Hi,

The index error message has no impact on TEtranscripts (see #82). Based on the output, I'm suspecting that there might be a mismatch between the TE and genome build used for alignments. Which genome build are you using, and from where (e.g. are you using hg38 from UCSC, or GRCh38 from Ensembl)? You might need a different TE GTF.

Thanks.

conery commented 7 months ago

Thanks for that quick reply! The genome is from WormBase, annotations in caenorhabditis_elegans.PRJNA13758.WBPS18.canonical_geneset.gtf. The TEs we downloaded from your lab, ce11_rmsk_TE.gtf.

Our plan was to make sure we could run TEtranscripts using the WB data set, but we'd prefer to use our own genome assembly. Do you have instructions for how to build our own TE reference?

olivertam commented 7 months ago

Hi,

There are ways to build your own TE reference, which would involve running a TE annotator on your genome (e.g. RepeatMasker), and then converting the output into a TE GTF using our custom perl script.

However, you shouldn't need to do this, as we now have a TE GTF that should work with the WormBase C. elegans genome.

More detailed explanation (feel free to ignore): The reason why the ce11 TE GTF doesn't work with your genome alignment is because UCSC adds chr to their chromosome names, but WormBase does not (e.g. I instead of chr I). Although ce11 and PRJNA13758 (also known as WBcel235) are actually the same genomic sequences, the difference in chromosomal nomenclature means that TEtranscripts can't match the alignment with the TE annotation, and thus why the genes were annotated, but not the TE.

Please let me know if you have other questions or issues.

Thanks.

conery commented 7 months ago

Thanks! I'll let you know how it goes.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days