mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

a question about the teloolkit input gff3 file #47

Closed ChenDepp closed 4 years ago

ChenDepp commented 4 years ago

some organisms like Malus(apple) just have a TE annotation gff3 ,how to get a curated annotation of the transposable elements?

olivertam commented 4 years ago

Hi, Thank you for your interest in the TEtoolkit. We are happy to take a look at the GFF3 file for Malus, and determine if it's easy to convert to the format that would be compatible with TEToolkit. Please feel free to provide a link to the GFF3 file, or send a copy to tam at cshl dot edu, and we will let you know if we can help. Thanks

ChenDepp commented 4 years ago

@olivertam thanks you!,https://iris.angers.inra.fr/gddh13/the-apple-genome-downloads.html include all Malus genome information,the gff3 file link is https://iris.angers.inra.fr/gddh13/downloads/GDDH13_1-1_TE.gff3.bz2, please,tell me how to do it ? thanks your very much!

olivertam commented 4 years ago

Hi,

I took a quick look at the GFF3 file that you provided, and have a couple of ideas on how to convert the file (that you can try).

Lines where column 3 says "match" looks like it contains the majority of the information about the predicted TE in column 9. These are probably the lines that I would use to generate the TE GTF file. Since the GFF3 format is very similar to GTF, you can keep columns 1, 2, and 4 to 8 as is, and just change column 3 to say "exon".

The difficult part is column 9, which contains most of the annotation information about the TE. You need four fields in column 9 for a TEtoolkit-compatible GTF file: gene_id (usually the TE name), transcript_id (a unique identifier for the particular TE copy in the genome), class_id (e.g. LTR/LINE/SINE, or Class I/II), and family_id (e.g. Gypsy or Copia). An example of how column 9 should look like (using a human TE as an example):

gene_id "L1HS"; transcript_id "L1HS_copy100"; class_id "LINE"; family_id "L1";

Looking at the GFF3 file that you linked to, it looks like they might provide most of the information that could be used for the TE GTF file. For example, in column 9 of the first line of the GFF3 file, you have the ID field ID=ms131072_Chr00_RLX_denovoMDO_kr-B-R1867-Map20;, which could be a good transcript_id candidate. You also have annotations for a number of known TE that matches the predicted apple TE: e.g. Gypsy-6_PX-I:ClassI:LTR:Gypsy:?: 90.80%, where Gypsy-6_PX-I could be the gene_id, ClassI or LTR could be the class_id, and Gypsy could be the family_id.

The difficulty with the GFF3 file is that there could be multiple hits for each predicted apple TE (e.g. the first predicted TE matched multiple known TE), and there are cases where a TE is predicted, but it's not clear if it matches a known TE. In this case, it might require further curation from those with greater knowledge of the organism in order to select the appropriate annotation for each prediction.

Please let me know if you have any additional questions.

Thanks.

ChenDepp commented 4 years ago

thanks your very much!,Your suggestion is very valuable.the TE annoation file is so complex that i can't use it.but i don't know how to correct it!,i will do it as you say? can you tell me how to re-annoation the genome te ?

olivertam commented 4 years ago

Hi,

Unfortunately, I don't have much experience with the apple genome to better curate their TE annotation. I don't think re-annotation is required (since this typically means another round of computational prediction with a different algorithm), but rather a refinement of the annotation based on some prediction cutoff or orthogonal validations.

I attempted to generate a version of the TE GTF from the GFF3 file that you provided. I applied most of what I had described in the previous post, but with three additional changes:

  1. I parsed the predicted TE annotations, and looked for the one with the highest "score" (the percentage). I only selected annotations that resemble known TE (e.g. ones that I could determine the family_id and class_id).
  2. For those where there are annotations, I changed the score column (column 5) to show the % score for that annotation. Thus, the values should go from 0 to 100.
  3. For cases where I don't see an obvious annotation, I created an entry that looks like the example below. The gene_id is the same as the transcipt_id, and I arbitrarily designated the class_id as Insilico_Predicted, and the family_id as TE. This is an artificial entry to ensure that the predicted TE is present, but these would certainly require additional curation down the line. gene_id "ms131093_Chr00_DXX-MITE_denovoMDO_kr-B-G11849-Map5"; transcript_id "ms131093_Chr00_DXX-MITE_denovoMDO_kr-B-G11849-Map5"; class_id "Insilico_Predicted"; family_id "TE";

The TE GTF file is located here.

Please feel free to modify this file to suit your needs, and let me know if you have other questions.

Thanks

ChenDepp commented 4 years ago

TEtoolkit hello,i use the gtf you provide.it report the error! can you tell me how to reslove it? thanks you!

ChenDepp commented 4 years ago

TEtoolkit

olivertam commented 4 years ago

Hi,

I have also made minor fixes to the TE GTF file (available here) that might be the cause, so feel free to try it again.

If you are still encountering this issue, could you then provide the command line that you used, and the count table file that was generated (file ending in .cntTable)?

Thanks