oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
348 stars 73 forks source link

Negative coordinates in TEanno.gff3 #263

Open nhartwic opened 2 years ago

nhartwic commented 2 years ago

Basically the title. Here is the weird lines from the gff3 file...

15593   EDTA    repeat_region   -2      3598    .       ?       .       ID=repeat_region_23903;Name=TE_00012818;Classification=LTR/unknown;Sequence_ontology=SO:0000657;ltr_identity=0.9733;Method=structural;motif=TGCA;tsd=TTAAT
15593   EDTA    target_site_duplication -2      2       .       ?       .       ID=lTSD_23903;Parent=repeat_region_23903;Name=TE_00012818;Classification=LTR/unknown;Sequence_ontology=SO:0000434;ltr_identity=0.9733;Method=structural;motif=TGCA;tsd=TTAAT

...I've never seen negative coordinates like this before in any of my other EDTA runs. I'm not really sure what this is supposed to mean, but my downstream tools really don't like it.

I'm currently running EDTA version 1.9.6.

Let me know if there are any files I can send to try to figure out what happened here. In the mean time, I've noticed that EDTA 2.0 released a few months ago, so I suppose I'll update. As to this specific output, I'm just going to manually edit the gff3 to fix this entry and move on with life.

oushujun commented 2 years ago

Hi @nhartwic,

It looks like a bug. Can you send the contig sequence 15593 to my email shujun.ou.1@gmail.com? Thanks!

Shujun

nhartwic commented 2 years ago

Apologies for the delay on this. Got sidetracked.

EtweTM011.v2.15593.fasta.gz EtweTM011.v2.fasta.mod.EDTA.TElib.fa.gz

Here is the contig and the repeat library that EDTA generated for whole assembly.

oushujun commented 10 months ago

Hello @nhartwic,

Sorry for the long overdue. This issue originated in LTR_retriever for LTR candidates found at the boundary of sequences (i.e., contig 15593 in your case). LTR_retriever needs to extract 50bp flanking the candidate for further analysis. The element in your case starts at position 6 of contig 15593, leaving insufficient flanking sequence for the program and thus producing erroneous results. I have set filters to remove cases like these because they could not provide sufficient flanking sequences for LTR_retriever to determine the authenticity of the candidate. The update is reflected in this commit: https://github.com/oushujun/LTR_retriever/commit/4039eb7778fd9cbc60021e99a8693285e0fa2daf.

You may manually remove such cases or rerun LTR_retriever on EDTA/raw using the latest version on github. Note that the conda version is lagging and not as new as the github version.

Hope this helps! Sorry again for the delay. Please let me know if you have further questions.

Best, Shujun