Personal GTF file - Githubissues

MonkeySylvia commented 6 years ago

Hi, I'm using a personal gtf file from UCSC (zebrafish), and it has an error "TE GTF format error! There is no annotation at line 1" I found that from #6 , you have a file to fix the issue, but I can't find it online. Here is my format (changed the column 6 to integers)
chr1 danRer10_rmsk exon 16773932 16780499 0 - . gene_id_"Gypsy51-I_DR";_transcript_id_"Gypsy51-I_DR"; chr1 danRer10_rmsk exon 41943018 41943117 0 - . gene_id_"DNA-8-14_DR";_transcript_id_"DNA-8-14_DR"; chr1 danRer10_rmsk exon 50331622 50331813 0 + . gene_id_"ANGEL";_transcript_id_"ANGEL"; May I know where to download the file to fix the error? Thanks! Sylvia

olivertam commented 6 years ago

Hi,

I noticed that there are no family_id or class_id provided in the GTF file. You can use the gene_id for the family_id and class_id, but they have to be defined for TEtranscripts to parse the GTF file correctly.

When making our TE GTF from the UCSC rmsk track, we select the following fields from the rmsk table: genoName - chromosome genoStart - start position genoEnd - end position strand - orientation repName - TE name (used for gene_id, and forms part of the transcript_id) repClass - TE class (used for class_id) repFamily - TE family (used for family_id).

We then generate the TE GTF using these fields, with the transcript_id being the [repName]_dup[#] if the TE annotation is found in multiple locations.

Hope this is helpful. Please let me know if you have any questions. Thanks.

MonkeySylvia commented 6 years ago

Hi Oliver, I tried to reformatting my gtf file, but it still runs into the same error. Could you please help me to reformatting my gtf file? I upload the original one here: https://www.dropbox.com/s/qoq7rynmx1n677s/danRer10_repeatmasker0312.gtf?dl=0 Thank you so much! Sylvia

olivertam commented 6 years ago

Hi Sylvia,

It looks like it is still missing the class_id and family_id fields in column 9 of the GTF file. I took the information from the danRer10 UCSC rmsk file, and appended the class_id and family_id to the file. Also, I noticed that only chromosome 10 annotations are in this GTF file. Is that intentional? The updated file is available here: https://www.dropbox.com/s/8xgpovk68q6p1nw/danRer10_repeatmasker0312_updated.gtf.gz?dl=0 Please let me know if the file is still not working for you. Thanks.

olivertam commented 6 years ago

If you would like the danRer10 TE GTF for the whole genome, you can download it from here. Thanks

MonkeySylvia commented 6 years ago

Thanks thats very helpful!!!!

Paterson91 commented 4 years ago

Hi,

I noticed that there are no family_id or class_id provided in the GTF file. You can use the gene_id for the family_id and class_id, but they have to be defined for TEtranscripts to parse the GTF file correctly.

When making our TE GTF from the UCSC rmsk track, we select the following fields from the rmsk table: genoName - chromosome genoStart - start position genoEnd - end position strand - orientation repName - TE name (used for gene_id, and forms part of the transcript_id) repClass - TE class (used for class_id) repFamily - TE family (used for family_id).

We then generate the TE GTF using these fields, with the transcript_id being the [repName]_dup[#] if the TE annotation is found in multiple locations.

Hope this is helpful. Please let me know if you have any questions. Thanks.

Hi Oliver,

Just a quick question as to how to achieved this? I'm currently in the process of producing the same for the canine genome. Any pointers would be greatly appreciated, thank you!

Alex

olivertam commented 4 years ago

Hi Alex,

We have a perl script that we can use to make the TE GTF:

 Usage: makeTEgtf.pl -c [chrom column] -s [start column] -e [stop/end column] 
                     -o [strand column] -n [source] -t [TE name column] 
                     (-f [TE family column] -C [TE class column] -1)
                     [INFILE]

 Output is printed to STDOUT
 Required parameters:
  -c [chrom column]     -    Column containing chromosome name
  -s [start column]     -    Column containing feature start position
  -e [stop/end column]  -    Column containing feature stop/end position
  -o [strand column]    -    Column containing strand information (+ or -)
  -t [TE name column]   -    Column containing TE name
  [INFILE]              -    File name to be processed into GTF

 Optional parameters:
  -n [source]           -    Source of the TE information 
                             (e.g. mm9_rmsk for RepeatMasker track from
                              mm9 mouse genome)
                             Defaults to "user-provided" if not specified
  -f [TE family column] -    Column containing TE family name. 
                             Defaults to TE name if not specified
  -C [TE class column]  -    Column containing TE class name. 
                             Defaults to TE family name if not specified
  -S [score column]     -    Column containing the score of the TE prediction
                             (e.g. score from RepeatMasker)
  -1                    -    Input coordinates uses 1-based indexing
                             This should be used if the input file uses
                             1-based coordinates. This should be invoked
                             if the genomic coordinates are obtained from
                             a GFF3, GTF, SAM or VCF file
                             Default: off if using BED, BAM or UCSC rmsk
                                      input files

As it is hopefully clear from the usage, you need, at a minimum, a tab-separated file containing the chromosome name, start position, end position, strand information and TE name. We also recommend having a column for the TE class and family names, though they would default to the TE name if not provided. Please let me know if you have any issues.

Thanks

Paterson91 commented 4 years ago

Hi Oliver,

Thank you for your speedy response!

I used this as a bit of a training exercise to gen up on my python knowledge, however to no avail! I received the below error for my GTF file;

TE GTF format error! There is no annotation at line 1. Error in building TE index

I will give your script a go to see how I get on, but any thoughts as to where I might have gone wrong with my own efforts?

No worries if you’re pushed for time, this is purely an exercise to learn a bit more python bits.

Thanks again for your response,

Alex

Dr. Alex Paterson (Research Associate Bioinformatician, Bristol Genomics Facility)

University of Bristol Life Sciences Building 24 Tyndall Avenue Bristol United Kingdom, BS8 1TQ

Tel: (+44) 0117 39 41429

Email: a.paterson@bristol.ac.ukmailto:a.paterson@bristol.ac.uk Email: genomics-facility@bristol.ac.ukmailto:genomics-facility@bristol.ac.uk Skype: apaterson91

Web: Bristol.ac.uk/biology/genomics-facilityhttp://bristol.ac.uk/biology/genomics-facility

On Mar 16, 2020, at 4:07 PM, Oliver Tam notifications@github.com<mailto:notifications@github.com> wrote:

Hi Alex,

We have a perl scripthttp://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/makeTEgtf.pl.gz that we can use to make the TE GTF:

Usage: makeTEgtf.pl -c [chrom column] -s [start column] -e [stop/end column] -o [strand column] -n [source] -t [TE name column] (-f [TE family column] -C [TE class column] -1) [INFILE]

Output is printed to STDOUT Required parameters: -c [chrom column] - Column containing chromosome name -s [start column] - Column containing feature start position -e [stop/end column] - Column containing feature stop/end position -o [strand column] - Column containing strand information (+ or -) -t [TE name column] - Column containing TE name [INFILE] - File name to be processed into GTF

Optional parameters: -n [source] - Source of the TE information (e.g. mm9_rmsk for RepeatMasker track from mm9 mouse genome) Defaults to "user-provided" if not specified -f [TE family column] - Column containing TE family name. Defaults to TE name if not specified -C [TE class column] - Column containing TE class name. Defaults to TE family name if not specified -S [score column] - Column containing the score of the TE prediction (e.g. score from RepeatMasker) -1 - Input coordinates uses 1-based indexing This should be used if the input file uses 1-based coordinates. This should be invoked if the genomic coordinates are obtained from a GFF3, GTF, SAM or VCF file Default: off if using BED, BAM or UCSC rmsk input files

As it is hopefully clear from the usage, you need, at a minimum, a tab-separated file containing the chromosome name, start position, end position, strand information and TE name. We also recommend having a column for the TE class and family names, though they would default to the TE name if not provided. Please let me know if you have any issues.

Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/mhammell-laboratory/TEtranscripts/issues/21#issuecomment-599621324, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANPNRGV7O3YYGFAUTVATITDRHZFDTANCNFSM4EUVRCGQ.

Paterson91 commented 4 years ago

Hi Oliver,

You don’t happen to have a local copy of your perl script do you? For some reason it’s not allowing a connection with the host server.

Best wishes,

Alex

Dr. Alex Paterson (Research Associate Bioinformatician, Bristol Genomics Facility)

University of Bristol Life Sciences Building 24 Tyndall Avenue Bristol United Kingdom, BS8 1TQ

Tel: (+44) 0117 39 41429

Email: a.paterson@bristol.ac.ukmailto:a.paterson@bristol.ac.uk Email: genomics-facility@bristol.ac.ukmailto:genomics-facility@bristol.ac.uk Skype: apaterson91

Web: Bristol.ac.uk/biology/genomics-facilityhttp://Bristol.ac.uk/biology/genomics-facility

On Mar 16, 2020, at 4:07 PM, Oliver Tam notifications@github.com<mailto:notifications@github.com> wrote:

perl scripthttp://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/makeTEgtf.pl.gz

olivertam commented 4 years ago

Hi Alex,

It looks like our server might be down temporarily. I've attached a copy here.

I can help you troubleshoot your TE GTF if you are interested. I would need an excerpt of your output to see what might be missing. You can either post it here, or send it to me at tam at cshl dot edu

Thanks

mhammell-laboratory / TEtranscripts

Personal GTF file #21