oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
177 stars 40 forks source link

LTR_retriever reports redundant/duplicated intact LTR-RT from inputs of both LTR_finder and LTR_harvest #28

Closed b524198065 closed 5 years ago

b524198065 commented 5 years ago

Hi Shujun,

When results from both LTR_finder and LTR_harvest were given to the LTR_retriever, I found few likely duplicated intact LTR-RT results in the pass.list and pass.list.gff3 file, which are interesting:

like this:

tig00000022_1:257836..262566 pass motif:TGCA TSD:ACTAC 257831..257835 262567..262571 IN:258538..261865 0.9672 - unknown NA 1289956 tig00000022_1:257836..262566 pass motif:TGCA TSD:GTAGT 257831..257835 262567..262571 IN:258538..261865 0.9673 ? unknown NA 1285934

and this:

tig00000209:486380..491904 pass motif:TGCA TSD:TTG 486377..486379 491905..491907 IN:486636..491647 0.9804 - unknown LTR 763871 tig00000209:486385..491904 pass motif:TGCA TSD:NA .. .. IN:486636..491652 0.9802 - unknown LTR 771771

tig00000241:1060515..1070650 pass motif:TGCA TSD:TTTGT 1060510..1060514 1070651..1070655 IN:1060827..1070338 0.9904 - unknown LTR 371614 tig00000241:1061545..1066430 pass motif:TGCA TSD:AAAAC 1061540..1061544 1066431..1066435 IN:1061687..1066288 0.993 - unknown LTR 270495

It seems like that only part of the features (e.g. TSD) of the two redundant entries are different, but their locations on the genome were almost the same.

Despite the fact that the number of the likely duplicated intact LTR-RT is low (5 of 497 candidates), I think it is still good to ensure the results are reliable. How do I know which the better or proper predicted result is and remove the duplicated one?

Many thanks,

Hongbo

oushujun commented 5 years ago

Hi Hongbo,

You can randomly select one from the seemingly duplicates. They are equally reliable. The redundancy will be removed during the library construction procedure.

What cause this is the different prediction results generated by LTRharvest and LTR_finder, such that LTR_retriever has to figure out the true case from different directions and has a chance to find features that are not exactly the same but both fit the current definition of LTR-RT.

Can you provide the sequences of these candidates with 100bp extended on both ends? I can further look into them and see if there is a way to improve the algorithm.

Thanks, Shujun

b524198065 commented 5 years ago

@oushujun Here are the fasta format sequences file for the three examples above.

example_1.txt example_2.txt example_3.txt

oushujun commented 5 years ago

Thanks! I will look into them when time allows.

Shujun

oushujun commented 5 years ago

Hi Hongbo,

Thank you for providing the sequence.

For the first one:

tig00000022_1:257836..262566 pass motif:TGCA TSD:ACTAC 257831..257835 262567..262571 IN:258538..261865 0.9672 - unknown NA 1289956 tig00000022_1:257836..262566 pass motif:TGCA TSD:GTAGT 257831..257835 262567..262571 IN:258538..261865 0.9673 ? unknown NA 1285934

This is a true LTR, but due to a minor bug, the direction information of the first case is inherited from LTR_FINDER, resulting the TSD was converted into the complementary sequence. To be consistent, direction information from LTR_FINDER will be removed and de novo inferred using the LTR_retriever algorithm.

For second one:

tig00000209:486380..491904 pass motif:TGCA TSD:TTG 486377..486379 491905..491907 IN:486636..491647 0.9804 - unknown LTR 763871 tig00000209:486385..491904 pass motif:TGCA TSD:NA .. .. IN:486636..491652 0.9802 - unknown LTR 771771

I used the sequence you provided, and I only got the first prediction. This is a true LTR, however, the prediction biases to pick the motif of TG-CA other than non-canonical motifs. In this case, the TSD-motif sequence should be 5'-CATTG[TG...TA]CATTG-3', but LTR_retriever picked up 5'-TTG[TG...CA]TTG-3' instead. I have corrected the bias and push the updates to the v2.0 LTR_retriever.

For the last one:

tig00000241:1060515..1070650 pass motif:TGCA TSD:TTTGT 1060510..1060514 1070651..1070655 IN:1060827..1070338 0.9904 - unknown LTR 371614 tig00000241:1061545..1066430 pass motif:TGCA TSD:AAAAC 1061540..1061544 1066431..1066435 IN:1061687..1066288 0.993 - unknown LTR 270495

This is a case of nested LTR elements, with the second LTR element inserted into the first element. LTR_retriever takes care of such cases. Nested sequences would be removed in the final library if intact versions of such sequences are found.

I have fixed the bugs to avoid the first two cases, as well as other minor bugs and push a new version of LTR_retriever to the repository. The new version has similar annotation performance comparing to previous versions, but with better details in terms of TSD and motif identification. Hope these helps!

Best, Shujun