oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
315 stars 70 forks source link

EDTA have problem to identify LTR-RT when an intact LTR-RT close with a solo LTR-RT that is from the same family. #355

Open qjiangzhao opened 1 year ago

qjiangzhao commented 1 year ago

Hi Shujun,

I found a problem for EDTA to identify LTR-RT when an intact LTR-RT close with a solo LTR-RT that is from the same family.

For example, the longer LTR-RT (repeat_region_580) share the same left TSD and left LTR. Actually, the shorter LTR-RT (repeat_region_579) contains all required domains, which means it should be an intact LTR-RT.

On the other hand, I don't think two intact LTR-RT should share same TSD and LTR.

I found many annotation like from my data.

scaffold_9 EDTA repeat_region 1825990 1837407 . + . ID=repeat_region_580;Name=TE_00001108;Classification=LTR/Gypsy;Sequence_ontology=SO:0000657;ltr_identity=0.9935;Method=structural;motif=TGCA;tsd=TTCAC scaffold_9 EDTA target_site_duplication 1825990 1825994 . + . ID=lTSD_580;Parent=repeat_region_580;Name=TE_00001108;Classification=LTR/Gypsy;Sequence_ontology=SO:0000434;ltr_identity=0.9935;Method=structural;motif=TGCA;tsd=TTCAC scaffold_9 EDTA long_terminal_repeat 1825995 1826147 . + . ID=lLTR_579;Parent=repeat_region_579;Name=TE_00001108;Classification=LTR/Gypsy;Sequence_ontology=SO:0000286;ltr_identity=1.0000;Method=structural;motif=TGCA;tsd=TTCAC scaffold_9 EDTA Gypsy_LTR_retrotransposon 1825995 1832268 . + . ID=LTRRT_579;Parent=repeat_region_579;Name=TE_00001108;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;ltr_identity=1.0000;Method=structural;motif=TGCA;tsd=TTCAC scaffold_9 EDTA long_terminal_repeat 1825995 1826147 . + . ID=lLTR_580;Parent=repeat_region_580;Name=TE_00001108;Classification=LTR/Gypsy;Sequence_ontology=SO:0000286;ltr_identity=0.9935;Method=structural;motif=TGCA;tsd=TTCAC scaffold_9 EDTA Gypsy_LTR_retrotransposon 1825995 1837402 . + . ID=LTRRT_580;Parent=repeat_region_580;Name=TE_00001108;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;ltr_identity=0.9935;Method=structural;motif=TGCA;tsd=TTCAC scaffold_9 EDTA long_terminal_repeat 1832116 1832268 . + . ID=rLTR_579;Parent=repeat_region_579;Name=TE_00001108;Classification=LTR/Gypsy;Sequence_ontology=SO:0000286;ltr_identity=1.0000;Method=structural;motif=TGCA;tsd=TTCAC scaffold_9 EDTA target_site_duplication 1832269 1832273 . + . ID=rTSD_579;Parent=repeat_region_579;Name=TE_00001108;Classification=LTR/Gypsy;Sequence_ontology=SO:0000434;ltr_identity=1.0000;Method=structural;motif=TGCA;tsd=TTCAC scaffold_9 EDTA long_terminal_repeat 1837250 1837402 . + . ID=rLTR_580;Parent=repeat_region_580;Name=TE_00001108;Classification=LTR/Gypsy;Sequence_ontology=SO:0000286;ltr_identity=0.9935;Method=structural;motif=TGCA;tsd=TTCAC scaffold_9 EDTA target_site_duplication 1837403 1837407 . + . ID=rTSD_580;Parent=repeat_region_580;Name=TE_00001108;Classification=LTR/Gypsy;Sequence_ontology=SO:0000434;ltr_identity=0.9935;Method=structural;motif=TGCA;tsd=TTCAC

image

Yours sincerely Jiangzhao

oushujun commented 1 year ago

Hi Jiangzhao,

Thank you for your report. That's great observations. It seems these two "intact" elements share the same 5' LTR and TSD. I can determine if the 3' most LTR is a solo LTR or tandem duplication. You will need to check if the sequence between the two 3' LTRs contains any LTR coding sequences. I agree that the two intact LTR are not likely to share TSDs. I am not sure if this is misannotation or misassembly or real biology without further checks.

On a side note, LTR_retriever process each candidate independently, so as long as the candidate possesses all the needed features, it will be reported as an intact LTR. In this case, both seem to be structurally intact, but the shorter one is younger.

Shujun

qjiangzhao commented 1 year ago

Hi Shujun,

Thanks for your reply and explanation. I will choose to use the short version in this case. I also don't why there are some annotations like this, maybe homologous recombination.

Yours sincerely Jiangzhao

oushujun commented 1 year ago

If you want to dig deeper into this issue, you can check if the sequence between the two 3' LTR are internal LTR sequence or just intergenic sequence. This will help to confirm if the extra LTR is a real solo LTR. And if its a solo, it should contain two TSDs flanking the solo. Then see if you can find at least one read that contains this structure to rule out misassembly.

Shujun

On Mon, Apr 24, 2023 at 7:21 AM Jiangzhao_Qian @.***> wrote:

Hi Shujun,

Thanks for your reply and explanation. I will choose to use the short version in this case. I also don't why there are some annotations like this, maybe homologous recombination.

Yours sincerely Jiangzhao

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/355#issuecomment-1519950685, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NDRHKWWTODRPYY477TXCZO3FANCNFSM6AAAAAAXGU7TFM . You are receiving this because you commented.Message ID: @.***>

qjiangzhao commented 1 year ago

Hi Shujun,

I compared it with my manually curated TE annotation track. The sequence between the two 3' LTRs are TE elements, which contain part of LINE element and LTR-RT. For this reason, I would like to say it is a solo LTR. Might I ask why it should contain two TSDs flanking the solo?

image

Yours sincerely Jiangzhao

oushujun commented 1 year ago

Hi Jiangzhao,

thanks for doing the curation. I think the second 3' LTR plus the extra piece of LTR sequences is a truncated LTR and happen to share the same TSD with the first 3'LTR. The LINE element may be a later comer insert into the second LTR element. The solo LTR should have TSD flanking becase it is created by illegitimate recombination between the two LTR, and thus should be nothing left but just one LTR and two TSD.

Shujun