mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Error when using personal TE GTF file #6

Closed casimonet closed 7 years ago

casimonet commented 8 years ago

Hi,

I am trying to use TEtranscript for an organism for which you don't provide the TE annotation file. I build the GTF file myself, using rmsk tables from UCSC, assigning repName to gene_ID attribute, a unique TE identifier to transcript_ID and repFamily and repClass to family_ID and class_ID. My GTF file looks exactly like the ones you provide.

Yet, I get an error when running TEtranscript: "Error in building gene/TE index"

I know the problem does not come from my command since I already used TEtranscript for human and it worked perfectly.

Is there any specification about the processing of these TE GTF file that I should know about ?

Thank you, CAnna

molikd commented 7 years ago

Hello,

It's tough to say why the TE GTF file is not loading without looking at a the format of the TE GTF file. Possibly it could be formatting, or it is also possible that the file has the wrong permissions, or some other problem?

Do you receive any other output during the build step of TETranscripts? what is the full output of the run?

If you could also head the files as well:

dmolik@computer:~$ head -n 5 hg19_refgene.gtf
chr6_mcf_hap5   refGene exon    4604235 4605361 .   -   .   gene_id "COL11A2"; transcript_id "NM_080679"; exon_number "1"; exon_id "NM_080679.1"; gene_name "COL11A2";
chr6_mcf_hap5   refGene 3UTR    4604235 4605220 .   -   .   gene_id "COL11A2"; transcript_id "NM_080679"; exon_number "1"; exon_id "NM_080679.1"; gene_name "COL11A2";
chr6_mcf_hap5   refGene CDS 4605224 4605361 .   -   0   gene_id "COL11A2"; transcript_id "NM_080679"; exon_number "1"; exon_id "NM_080679.1"; gene_name "COL11A2";
chr6_mcf_hap5   refGene exon    4605810 4606016 .   -   .   gene_id "COL11A2"; transcript_id "NM_080679"; exon_number "2"; exon_id "NM_080679.2"; gene_name "COL11A2";
chr6_mcf_hap5   refGene CDS 4605810 4606016 .   -   0   gene_id "COL11A2"; transcript_id "NM_080679"; exon_number "2"; exon_id "NM_080679.2"; gene_name "COL11A2";
dmolik@computer:~$ head -n 5 hg19_rmsk_TE.gtf
chr1    hg19_rmsk   exon    16777161    16777470    .   +   .   gene_id "AluSp"; transcript_id "AluSp"; family_id "Alu"; class_id "SINE";
chr1    hg19_rmsk   exon    25165801    25166089    .   -   .   gene_id "AluY"; transcript_id "AluY"; family_id "Alu"; class_id "SINE";
chr1    hg19_rmsk   exon    33553607    33554646    .   +   .   gene_id "L2b"; transcript_id "L2b"; family_id "L2"; class_id "LINE";
chr1    hg19_rmsk   exon    50330064    50332153    .   +   .   gene_id "L1PA10"; transcript_id "L1PA10"; family_id "L1"; class_id "LINE";
chr1    hg19_rmsk   exon    58720068    58720973    .   -   .   gene_id "L1PA2"; transcript_id "L1PA2"; family_id "L1"; class_id "LINE";

As to other specifications of files we try to follow the UCSC: https://genome.ucsc.edu/FAQ/FAQformat.html#format4

Unfortunately, we do not have our own Documentation when it comes GTF files in TETranscripts.

Regards, David

casimonet commented 7 years ago

Hi,

Thank you for your answer. Below are the full output of the run and extracts of the gene and TE GTF files.

Gene GTF file

chr1 ensembl gene 23739 27683 . - . gene_id "ENSMMUG00000013590"; gene_version "2"; gene_name "HMX1"; gene_source "ensembl"; gene_biotype "protein_coding"; chr1 ensembl transcript 23739 27683 . - . gene_id "ENSMMUG00000013590"; gene_version "2"; transcript_id "ENSMMUT00000019076"; transcript_version "2"; gene_name "HMX1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "HMX1-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; chr1 ensembl exon 27582 27683 . - . gene_id "ENSMMUG00000013590"; gene_version "2"; transcript_id "ENSMMUT00000019076"; transcript_version "2"; exon_number "1"; gene_name "HMX1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "HMX1-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000131358"; exon_version "1"; chr1 ensembl CDS 27582 27683 . - 0 gene_id "ENSMMUG00000013590"; gene_version "2"; transcript_id "ENSMMUT00000019076"; transcript_version "2"; exon_number "1"; gene_name "HMX1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "HMX1-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSMMUP00000017866"; protein_version "2"; chr1 ensembl start_codon 27681 27683 . - 0 gene_id "ENSMMUG00000013590"; gene_version "2"; transcript_id "ENSMMUT00000019076"; transcript_version "2"; exon_number "1"; gene_name "HMX1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "HMX1-201"; transcript_source "ensembl"; transcript_biotype "protein_coding";

TE GTF file

chr1 rm2_rmsk exon 54432526 54432836 . - . gene_id "AluJo"; transcript_id "AluJo_dup3890"; family_id "Alu"; class_id "SINE"; chr1 rn5_rmsk exon 54437502 54437813 . - . gene_id "AluJo"; transcript_id "AluJo_dup3891"; family_id "Alu"; class_id "SINE"; chr1 rm2_rmsk exon 54456471 54456768 . + . gene_id "AluJo"; transcript_id "AluJo_dup3892"; family_id "Alu"; class_id "SINE"; chr1 rm2_rmsk exon 54458501 54458832 . + . gene_id "AluJo"; transcript_id "AluJo_dup3893"; family_id "Alu"; class_id "SINE"; chr1 rm2_rmsk exon 54458832 54459116 . - . gene_id "AluJo"; transcript_id "AluJo_dup3894"; family_id "Alu"; class_id "SINE";

Output of the run

INFO @ Tue, 12 Jul 2016 19:25:45: ARGUMENTS LIST: name = iPSC.Percentile_RheMac1 treatment files = ['/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11.10p15.6.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11.1p16.1.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11.2p16.2.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11.3p15.3.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11.7p15.4.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11.8p15.5.dedup.percentile.RheMac1.bam'] control files = ['/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.H9p54.10.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.HiPS.1.7.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.HiPS.3.92.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.HiPSC.2.8.dedup.percentile.RheMac1.bam'] GTF file = /data/share/htp/camille/iPSC/mapping.RheMac1/new/rheMac1.84.gtf.2 TE file = RheMac2_TE_GTF2 multi-mapper mode = multi stranded = yes normalization = DESeq_default (rpm: Reads Per Million mapped; quant: Quantile normalization) FDR cutoff = 5.00e-02 fold-change cutoff = 1.00 read count cutoff = 1 number of iteration = 10 Alignments grouped by read ID = False

INFO @ Tue, 12 Jul 2016 19:25:45: Processing GTF files ...

INFO @ Tue, 12 Jul 2016 19:25:45: Building gene index .......

100000 GTF lines processed. 200000 GTF lines processed. 300000 GTF lines processed. INFO @ Tue, 12 Jul 2016 19:28:12: Done building gene index ......

INFO @ Tue, 12 Jul 2016 19:28:25: Building TE index .......

Error in building gene/TE index srun: error: gorilla1: task 0: Exited with exit code 1

molikd commented 7 years ago

You're submitting to SLURM?

Generally if TETranscripts can read the file, but if its not the right format an "TE GTF format error! There is no annotation at line" error will occur. Similarly if TETranscripts can't read the file at all an "cannot open such file" error will occur.

This line of the output in the arguments list is odd: "TE file = RheMac2_TE_GTF2" and it makes me wonder if this is a file reference issue when you submit, when you run TETranscripts do give just the "file" or "/the/whole/path/to/the/file" ? Also, what does submission look like?

casimonet commented 7 years ago

Yes I am submitting to SLURM, here is the command:

!/bin/bash

SBATCH --error=TEtranscript.percentile.rhemac1.OnPerContigDedup.%J.err

SBATCH --output=TEtranscript.percentile.rhemac1.OnPerContigDedup.%J.out

SBATCH --nodelist=gorilla1

SBATCH --cpus-per-task=5

cd /data/share/htp/camille/iPSC/mapping.RheMac1/new/TEtranscriptRheMac1

 treat=`ls /data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11*.dedup.percentile.RheMac1.bam`
 contr=`ls /data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.H*.dedup.percentile.RheMac1.bam`

srun TEtranscripts -t $treat -c $contr --GTF /data/share/htp/camille/iPSC/mapping.RheMac1/new/rheMac1.84.gtf.2 --TE /data/share/htp/camille/iPSC/mapping.RheMac1/new/TEtranscriptRheMac1/RheMac2_TE_GTF2 --format BAM --mode multi --iteration 10 --fragmentLength 46 --stranded yes --sortByPos --project iPSC.Percentile_RheMac1

I tried again specifying the whole path and I get the same error (see below the output). So it means that it neither come from a format error neither that the file cannot be read?

`INFO @ Wed, 13 Jul 2016 21:29:42: ARGUMENTS LIST: name = iPSC.Percentile_RheMac1 treatment files = ['/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11.10p15.6.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11.1p16.1.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11.2p16.2.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11.3p15.3.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11.7p15.4.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.11.8p15.5.dedup.percentile.RheMac1.bam'] control files = ['/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.H9p54.10.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.HiPS.1.7.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.HiPS.3.92.dedup.percentile.RheMac1.bam', '/data/share/htp/camille/iPSC/mapping.RheMac1/new/dedup/perContigs/iPSC.HiPSC.2.8.dedup.percentile.RheMac1.bam'] GTF file = /data/share/htp/camille/iPSC/mapping.RheMac1/new/rheMac1.84.gtf.2 TE file = /data/share/htp/camille/iPSC/mapping.RheMac1/new/TEtranscriptRheMac1/RheMac2_TE_GTF2 multi-mapper mode = multi stranded = yes normalization = DESeq_default (rpm: Reads Per Million mapped; quant: Quantile normalization) FDR cutoff = 5.00e-02 fold-change cutoff = 1.00 read count cutoff = 1 number of iteration = 10 Alignments grouped by read ID = False

INFO @ Wed, 13 Jul 2016 21:29:42: Processing GTF files ...

INFO @ Wed, 13 Jul 2016 21:29:42: Building gene index .......

100000 GTF lines processed. 200000 GTF lines processed. 300000 GTF lines processed. INFO @ Wed, 13 Jul 2016 21:32:08: Done building gene index ......

INFO @ Wed, 13 Jul 2016 21:32:18: Building TE index .......

Error in building gene/TE index srun: error: gorilla1: task 0: Exited with exit code 1`

molikd commented 7 years ago

If you say that TETranscripts previously worked on SLURM, then it probably isn't SLURM itself, that path in the first post seemed odd, but on the second one seems to be resolved.

The TE GTF file may have duplicates or some white space issue. Is it possible that we (the MHammell lab) could have access to the TE GTF file to groom it? Or if not the whole file the first 20,000 lines?

casimonet commented 7 years ago

​ RheMac2_TE_GTF2_copy4.zip https://drive.google.com/file/d/0B-vhqBgAupV8aUFkblNzdXRDV0U/view?usp=drive_web ​Sure, here is the file. In this "copy4" I removed simple repeats and satellites. I would be interested in knowing what process you employ to groom these file though, as I will built other ones for other assemblies of macaque genome.

2016-07-13 23:27 GMT+02:00 David Molik notifications@github.com:

If you say that TETranscripts previously worked on SLURM, then it probably isn't SLURM itself, that path in the first post seemed odd, but on the second one seems to be resolved.

The TE GTF file may have duplicates or some white space issue. Is it possible that we (the MHammell lab) could have access to the TE GTF file to groom it? Or if not the whole file the first 20,000 lines?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mhammell-laboratory/tetoolkit/issues/6#issuecomment-232492024, or mute the thread https://github.com/notifications/unsubscribe/AQpUznFMx2arQ_6jdw9B-iPS744Phz_kks5qVVg6gaJpZM4JLFnK .

olivertam commented 7 years ago

Hi, I took a quick look at your file, and it looks like there are many lines where the end position (column 5) is in scientific notation rather than integers. I think that would cause some problems with the processing. We have a perl script that allows us to parse UCSC rmsk files. I'm sure that the lab would be happy to share, with the caveat that it may have unexpected bugs (as we haven't tested it much) Good luck.

casimonet commented 7 years ago

This resolved the problem, Thank you very much!

molikd commented 7 years ago

This issue will be incorporated in #7, the TEGroomer feature request.