oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
177 stars 40 forks source link

too many LTR/unknown and most of LTR/unknown are classified as LTR/Copia #51

Closed zhangrengang closed 4 years ago

zhangrengang commented 4 years ago

Thousands of LTR in a plant genome are clasified as unkown by LTR_retriever. However, most of them are clasified as Copia on the basis of GyDB as belows:

# *.retriever.scn.extend.fa.aa
  Count LTR_retriever   GyDB
    927 LTR     Copia   LTR     Copia
      2 LTR     Copia   LTR     Gypsy
     41 LTR     Gypsy   -       -
      1 LTR     Gypsy   LTR     Caulimoviridae
      5 LTR     Gypsy   LTR     Copia
   2266 LTR     Gypsy   LTR     Gypsy
      9 LTR     Gypsy   LTR     unknown
      5 LTR     unknown -       -
   1248 LTR     unknown LTR     Copia
     21 LTR     unknown LTR     Gypsy
      5 mixture Copia   -       -
     27 mixture Copia   LTR     Copia
      1 mixture Copia   LTR     Gypsy
      1 mixture Copia   Unknown unknown
     85 mixture Gypsy   LTR     Gypsy
      1 mixture unknown -       -
     14 mixture unknown LTR     Copia
      2 mixture unknown LTR     Gypsy
    352 notLTR  unknown -       -
      1 notLTR  unknown LTR     Caulimoviridae
      8 notLTR  unknown LTR     Copia
     17 notLTR  unknown LTR     Gypsy
     43 -       -       LTR     Copia   
    150 -       -       LTR     Gypsy 
      1 -       -       LTR     unknown 
      2 -       -       Unknown unknown 

I think there is an issue in annotate_TE.pl:

    $family="Gypsy" if ($gypsy>$copia and $copia/$gypsy<0.3);
    $family="Copia" if ($copia>$gypsy and $gypsy/$copia<0.3);

Copia has the same wieght (0.3) as Gypsy but Copia only has 8 PFAMs, ~1/3 of 28 PFAMs of Gypsy.

oushujun commented 4 years ago

Hello @zhangrengang,

I think this is a very good point and I agree that the classification of copia and gypsy in LTR_retriever is not the best scheme. I have been using the copia and gypsy specific hmms in rice to assign new LTR elements into these superfamilies. A better way would be to use the GyDB to assign superfamilies as you suggested. Another way I have been thinking of, but not yet get the time to implement, is to use the order of these conserved domains to classify, which is the fundamental difference between gypsy and copia.

If you can implement a better scheme, welcome to contribute! For benchmarking of accuracy, I use the rice curated TE library.

Best, Shujun

zhangrengang commented 4 years ago

Hello Dr. Ou, here is a simple implement. You may test it and/or intergrate it.

oushujun commented 4 years ago

Hello @zhangrengang ,

Thank you so much for developing these code in such a short time. I will test it soon and let you know.

Best, Shujun