mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
217 stars 29 forks source link

GTF issue #93

Closed singharchana23 closed 3 years ago

singharchana23 commented 3 years ago

Dear TEtranscript team,

I am using TEtranscript with TE/repeats annotation provided by Sol Genomics for tomato . The file that I am using can be found from FTP site at : https://solgenomics.net/organism/Solanum_lycopersicum/genome

(1) under following folder: /ITAG4.0_release/ITAG4.0_REPET_repeats_aggressive.gff

Head on gff file looks like: Screenshot 2021-07-02 at 16 29 37

I am not sure how can I convert this into acceptable GTF format. I tried converting GFF to GTF using gffread but it gives an error:

SL4.0ch00 S-MART exon 46053 46169 . - . transcript_id "ms602093_SL4_0ch00_RLX-incomp_SL4_6m-B-R2137-Map7_reversed"; TE GTF format error! There is no annotation at line 1.

(2) The another repeat file /ITAG4.0_release/ITAG4.0_RepeatModeler_repeats_light.gff , is created with RepeatMasker, only has Target , however the family information is given in /ITAG4.0_release/ITAG4.0_RepeatModeler_repeats_light.classified file. Can I use the perl script "makeTEgtf" that is provided by you in one of the thread "https://github.com/mhammell-laboratory/TEtranscripts/issues/21" to generate GTF? Because in ITAG4.0_RepeatModeler_repeats_light.classified file they have provided repeat class/family information together in one column and your script needs them separately.

Could you please suggest!

Thanks in advance!

olivertam commented 3 years ago

Hi,

Point 1 It looks like you're missing the following fields in your first GTF: family_id, class_id and gene_id. The gene_id value is what is used for quantification, while the family_id and class_id are for additional informational purposes (and can be the same as gene_id if desired). I made a version of that annotation as a TE GTF (available here) if you want to test it out.

Point 2 You could use the makeTEgtf.pl script (available here) to process the ITAG4.0_RepeatModeler_repeats_light.classified file, however, there needs to be a few formatting changes to the files to make it work.

  1. The file is (multi-)space delimited, and it has to be converted into tab-delimited
  2. The first two lines appear to be headers, and need to be commented out (Add # to the beginning of the line)
  3. There is empty space at the beginning of nearly all rows (which will become a tab once you make a tab-delimited file). Thus your column counts will have to be offset by one.
  4. Line 523223 appears to be an exception to the formatting quirk above, as it does not have any empty spaces at the beginning. Thus, you would need to either modify that line to add a tab, or remove all leading spaces from other lines.
  5. You can split the class and family fields into two if you like. I believe that it is as simple as doing substitution of / with a tab. Just be aware that simple repeats do not have the / in that column, and so the column number will be off. However, the perl script will ignore Simple_repeat entries (by design), and so it should still work.

I also made a version of this annotation as a TE GTF (available here) if you want to test it out.

Let me know if you encounter additional issues.

Thanks.

singharchana23 commented 3 years ago

Thanks very much Oliver for quick response and GTF. I will check the GTF and will get back to you. Thanks a lot!