mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
229 stars 30 forks source link

TEcount Pause for a long time at the TE index building stage #206

Closed iriirica closed 3 weeks ago

iriirica commented 3 weeks ago

Hello, Thank you for the great tool! I am trying to run the TEcounts, but I have encountered an issue.

Here is my log file, and the Building TE index step has ran for 6 days.

Fri Oct 25 19:20:18 CST 2024
.../TEtranscripts/TEtranscripts-2.2.3/bin/TEcount:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('TEtranscripts==2.2.3', 'TEcount')
INFO  @ Fri, 25 Oct 2024 19:20:19: 
# ARGUMENTS LIST:
# name = OM24090210
# BAM file = .../08.repeat_expression/01.data/OM24090210Aligned.sortedByCoord.out.bam
# GTF file = .../08.repeat_expression/01.data/Omar.gtf 
# TE file = .../08.repeat_expression/01.data/Omar_TE_hs.gtf 
# multi-mapper mode = multi 
# stranded = no 
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Fri, 25 Oct 2024 19:20:19: Processing GTF files ... 

INFO  @ Fri, 25 Oct 2024 19:20:19: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
INFO  @ Fri, 25 Oct 2024 19:22:38: Done building gene index ...... 

INFO  @ Fri, 25 Oct 2024 19:23:21: Building TE index ....... 

Here is the command I used:

TEcount --sortByPos --format BAM --mode multi -b $bam --GTF $genegtf --TE $TEgtf --project $i

The $TEgtf file was generated using the makeTEgtf.pl script you provided, and its size is 3.5 GB. The genome size I am working with is 6.2 GB.

Omar_Chr6       user_provided   exon    1       102     .       -       .       gene_id "TE1"; transcript_id "TE1"; family_id "TcMar-Tc1"; class_id "DNA"; gene_name "TE1:TE";
Omar_Chr4       user_provided   exon    3       1527    .       -       .       gene_id "TE2"; transcript_id "TE2"; family_id "PiggyBac"; class_id "DNA"; gene_name "TE2:TE";Omar_Chr7       user_provided   exon    9       40      .       +       .       gene_id "TE3"; transcript_id "TE3"; family_id "MITE"; class_id "DNA"; gene_name "TE3:TE";
Omar_Chr11      user_provided   exon    18      76      .       +       .       gene_id "TE4"; transcript_id "TE4"; family_id "unknown"; class_id "LTR"; gene_name "TE4:TE";
Omar_Chr12      user_provided   exon    26      510     .       -       .       gene_id "TE5"; transcript_id "TE5"; family_id "Gypsy"; class_id "LTR"; gene_name "TE5:TE";

Could you please help me understand what might be causing this extended runtime? Thank you for your assistance!

olivertam commented 3 weeks ago

Hi,

Thank you for your interest in the software. It is unclear from your TE GTF whether you have a unique ID for each gene_id. If so, that significantly slows down the TE index building (up to days). Could you confirm that you do not have unique gene_id values for each entry? If you prefer to use this GTF as-is, we would recommend TElocal and pre-building an index using this script might be better. Please note that it would still take many days to build that index.

Thanks.

iriirica commented 3 weeks ago

To distinguish TEs in different genomic locus, I have renamed all the TE entries before generating the TE GTF file, so gene_id is unique. And I'll try to run TElocal_indexer seperately, thanks for your suggestion! However, I still find myself a bit confused. Isn't it supposed to be unique for each TE entry? If there are duplicate gene_ids for TE, how to confirm which is expressed in the results?

olivertam commented 3 weeks ago

Hi,

The concept behind TEtranscripts is that we're measuring TE at the sub-family level, i.e. share the same consensus sequence in repeat libraries such as Repbase and Dfam. Thus, the gene_id corresponds to those sub-family/consensus. We find that this enables quantification of TE at a level that is still biologically meaningful (as most studies are assessing TE expression based on consensus mapping or qPCR with degenerate primers), and allows differential analysis with sufficient counts.

Thanks.

iriirica commented 3 weeks ago

I think I know how to analysis my TEs. Thanks again for your help!