mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Error with TE GFT while running TEtranscripts #107

Closed jpcartailler closed 2 years ago

jpcartailler commented 2 years ago

Greetings and thank you for not only releasing this method, but providing help here! While running TEtranscripts, I ran into an error building the TE index.

Am running TEtranscripts 2.2.1 in a Singularity container, imported from a Docker image I found on Docker Hub.

Here is the output tail:

... 
INFO  @ Wed, 02 Feb 2022 09:15:16: Building TE index ....... 

chr1    mmusculus_mm10_rmsk     exon    14314168        14314208        192     -       .       gene_id "MIRb"; transcript_id "MIRb_dup80"; familychr1       mmusculus_mm10_rmsk     exon    78283325        78283526        1233    -       .       gene_id "B3"; transcript_id "B3_dup3221"; family_id "B2"; class_id "SINE"; 
TE GTF format error! There is no annotation at line 14744. 
Error in building TE index 

The GTF files I used are as follows (I zipped them up and are publicly shared, as well as provided the head of them farther below):

For the error, TE GTF format error! There is no annotation at line 14744., line 14744 in the TE GTF file looks like like the rest of them as far as I can tell. Notes on how I built this file are below.

chr1    mmusculus_mm10_rmsk exon    13747982    13748092    819 -   .   gene_id "Lx7"; transcript_id "Lx7_dup225"; family_id "L1"; class_id "LINE";

I'm not sure why the error is preceded with an entry from somewhere else in the GTF (MIRb_dup80, on line 15318).

Any advice on how to approach this problem would be appreciated. Thanks!


TE GTF generated by:

perl makeTEgtf.pl -c 6 -s 7 -e 8 -o 10 -t 11 -n mmusculus_mm10_rmsk -f 13 -C 12 -S 2 mmusculus_mm10_rmsk > mmusculus_mm10_rmsk.gtf

TE GTF head:

chr1    mmusculus_mm10_rmsk exon    67108753    67108881    239 +   .   gene_id "RLTR17B_Mm"; transcript_id "RLTR17B_Mm"; family_id "ERVK"; class_id "LTR";
chr1    mmusculus_mm10_rmsk exon    8386826 8389555 8310    -   .   gene_id "Lx2"; transcript_id "Lx2"; family_id "L1"; class_id "LINE";
chr1    mmusculus_mm10_rmsk exon    16776989    16779051    32159   +   .   gene_id "L1_Mus1"; transcript_id "L1_Mus1"; family_id "L1"; class_id "LINE";
chr1    mmusculus_mm10_rmsk exon    33554409    33554640    216 -   .   gene_id "B4"; transcript_id "B4"; family_id "B4"; class_id "SINE";
chr1    mmusculus_mm10_rmsk exon    50329972    50335398    28308   +   .   gene_id "L1Md_T"; transcript_id "L1Md_T"; family_id "L1"; class_id "LINE";
chr1    mmusculus_mm10_rmsk exon    83885791    83886358    4868    -   .   gene_id "L1Md_T"; transcript_id "L1Md_T_dup1"; family_id "L1"; class_id "LINE";
chr1    mmusculus_mm10_rmsk exon    109051333   109052326   8314    +   .   gene_id "L1Md_T"; transcript_id "L1Md_T_dup2"; family_id "L1"; class_id "LINE";
chr1    mmusculus_mm10_rmsk exon    125828928   125829476   3167    +   .   gene_id "Lx5"; transcript_id "Lx5"; family_id "L1"; class_id "LINE";
chr1    mmusculus_mm10_rmsk exon    167772061   167772244   493 -   .   gene_id "L1M2"; transcript_id "L1M2"; family_id "L1"; class_id "LINE";
chr1    mmusculus_mm10_rmsk exon    184549327   184549452   584 +   .   gene_id "B3A"; transcript_id "B3A"; family_id "B2"; class_id "SINE";
chr1    mmusculus_mm10_rmsk exon    3145674 3145796 314 -   .   gene_id "RMER16A3"; transcript_id "RMER16A3"; family_id "ERVK"; class_id "LTR";

Gene GTF head:

##description: evidence-based annotation of the mouse genome (GRCm38), version M17 (Ensembl 92)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2018-03-22
chr1    HAVANA  gene    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_name "RP23-271O17.1"; level 2; havana_gene "OTTMUSG00000049935.1";
chr1    HAVANA  transcript      3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_name "RP23-271O17.1"; transcript_type "TEC"; transcript_name "RP23-271O17.1-001"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";
chr1    HAVANA  exon    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_name "RP23-271O17.1"; transcript_type "TEC"; transcript_name "RP23-271O17.1-001"; exon_number 1; exon_id "ENSMUSE00001343744.1"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";
chr1    ENSEMBL gene    3102016 3102125 .       +       .       gene_id "ENSMUSG00000064842.1"; gene_type "snRNA"; gene_name "Gm26206"; level 3;
chr1    ENSEMBL transcript      3102016 3102125 .       +       .       gene_id "ENSMUSG00000064842.1"; transcript_id "ENSMUST00000082908.1"; gene_type "snRNA"; gene_name "Gm26206"; transcript_type "snRNA"; transcript_name "Gm26206-201"; level 3; transcript_support_level "NA"; tag "basic";
olivertam commented 2 years ago

Hi,

Thank you for your interest in the software. It does appear that line 15318 in the GTF got concatenated with the next line, and was partially truncated. That's why the indexing failed. That does appear to be the only line where this occurred. I'm wondering if there was an issue in the original source file you used to create the GTF. Did you download this from UCSC, or was it custom generated?

Thanks.

jpcartailler commented 2 years ago

Thanks for the quick response!

I generated the GTF from what UCSC-generated "rmsk" - http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1262077703_u7NggORS5ROmC9L2CmzDdKpl0WJG&clade=mammal&org=Mouse&db=0&hgta_group=varRep&hgta_track=rmsk&hgta_table=rmsk&hgta_regionType=genome&position=&hgta_outputType=primaryTable&hgta_outFileName=mmusculus_mm10_rmsk.gz

I'm not seeing line 14744 concatenated or truncated. Here is what I see on line 14744: image

Sorry if I mis-understood. Thx!

olivertam commented 2 years ago

Hi,

Sorry, line 14744 is the line number after the GTF is sorted by chromosome, start and end (part of the indexing process). The line in the original GTF is 15318. I have edited my previous response. If you are interested in mm10, we do have a GTF for it already.

Thanks.

jpcartailler commented 2 years ago

Ah, I didn't realize it was resorting the GTF, which makes sense now. Thank you for your quick feedback and solutions. I'll definitely check out the pre-built GTF, but just rebuilt ours to make sure we can have a functioning one in case we want to fine-tune what we get out of UCSC's data.