Meaning of *.defalse file

oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/

GNU General Public License v3.0

176 stars 40 forks source link

Meaning of *.defalse file #86

Closed Weihankk closed 3 years ago

Weihankk commented 3 years ago

Hi Dr. Ou. I use LTR_retriever to analyze my genome. However, the number of intact LTR is much less than I expect and the LAI value is low. So I want to use the result of LTR_retriever to go back to my assembly genome and find why these false-positive LTRs were filtered out. I find the unpass LTRs were store in the defalse file. As I'm not very familiar with LTR, I tried to understand the information stored in defalse file and sequence and felt very difficult. So I want to try to find some help and want to know why this LTR is judged as a false-positive.

Take this LTR information stored in *defalse file as an example:

Seq3:1001744..1012949   false   motif:TGCA      TSD:GCCAT       1001739..1001743        1012950..1012954        IN:1002461..1012284     0.8906  -       Gypsy   LTR     4548803
        Adjust: NO      lLTR: NA        rLTR: NA
        Alignment regions: 3018, 3411, 3702, 4094
        LTR coordinates: 1001744, 1002460, 1012285, 1012949
        TSD-LTR overlap: 0
        Boundary missing: 0

Sequence of Seq3:1001744 - 1012949:

TGTTTGTGCAAATAATCTCACGGGAGAATATTCGCTTCTAGATCATTCAACCTTCG 
........................
AATCTTCTTTTCCACGTGGCGTGCCCTGATTGGAGACGAAAATATATGCTCCCACA

Sequence of Seq3:1001739 - 1001743 (5' TSD):

ATGGC

Sequence of Seq3:1012950 - 1012954 (3' TSD):

ATGGC

I found this LTR strand is '-', so I get the reverse complement of these sequences. Reverse complement sequence of Seq3:1001744 - 1012949:

TGTGGGAGCATATATTTTCGTCTCCAATCAGGGCACGCCACGTGGAAAAGAAGATT
...
CGAAGGTTGAATGATCTAGAAGCGAATATTCTCCCGTGAGATTATTTGCACAAACA

Sequence of Seq3:1001739 - 1001743 (5' TSD):

GCCAT

Sequence of Seq3:1012950 - 1012954 (3' TSD):

GCCAT

I am very curious about how LTR_retriever judges the LTR is/not intact based on this information. If we assumed that this position is a intact LTR, but it is filtered out due to assembly errors, can you see if there are any obvious assembly errors in these sequences?

Weihankk commented 3 years ago

Additionally, I try to use only this LTR to run LTR_retriever, I get the stdout:

Thu Jan  7 21:04:01 CST 2021    Dependency checking: All passed!
Thu Jan  7 21:04:15 CST 2021    LTR_retriever is starting from the Init step.
Thu Jan  7 21:04:15 CST 2021    Start to convert inputs...
                                Total candidates: 1
                                Total uniq candidates: 1

Thu Jan  7 21:04:17 CST 2021    Module 1: Start to clean up candidates...
                                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                                Sequences containing tandem repeats will be discarded.

Thu Jan  7 21:04:17 CST 2021    1 clean candidates remained

Thu Jan  7 21:04:17 CST 2021    Modules 2-5: Start to analyze the structure of candidates...
                                The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Thu Jan  7 21:04:20 CST 2021    Intact LTR-RT found: 0

Thu Jan  7 21:04:20 CST 2021    No LTR-RT was found in your data.

Thu Jan  7 21:04:20 CST 2021    All analyses were finished!

It seems that this LTR does not meet the criteria of terminal motif, TSD, boundary, orientation, age and superfamily ? How to determine which step does not meet the standard?

oushujun commented 3 years ago

Hi Weihan,

Thank you for providing the outputs. Based on the information from the defalse file, it seems that the program can not find the terminal direct repeat from this candidate. What it found for the direct repeat is 3018-3411, 3702-4094 on the candidate with length > 10kb, so LTR_retriever could not identify the terminal repeat and determine this is not an intact LTR. To verify, you may blast the candidate against itself and see where things align.

There are many criteria used together to determine the authenticity of a candidate, you may read the paper for more details or even the code. Let me know if you have any questions.

Best, Shujun

On Thu, Jan 7, 2021 at 9:19 PM Weihan notifications@github.com wrote:

Additionally, I try to use only this LTR to run LTR_retriever, I get the stdout:

Thu Jan 7 21:04:01 CST 2021 Dependency checking: All passed! Thu Jan 7 21:04:15 CST 2021 LTR_retriever is starting from the Init step. Thu Jan 7 21:04:15 CST 2021 Start to convert inputs... Total candidates: 1 Total uniq candidates: 1

Thu Jan 7 21:04:17 CST 2021 Module 1: Start to clean up candidates... Sequences with 10 missing bp or 0.8 missing data rate will be discarded. Sequences containing tandem repeats will be discarded.

Thu Jan 7 21:04:17 CST 2021 1 clean candidates remained

Thu Jan 7 21:04:17 CST 2021 Modules 2-5: Start to analyze the structure of candidates... The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Thu Jan 7 21:04:20 CST 2021 Intact LTR-RT found: 0

Thu Jan 7 21:04:20 CST 2021 No LTR-RT was found in your data.

Thu Jan 7 21:04:20 CST 2021 All analyses were finished!

It seems that this LTR does not meet the criteria of terminal motif, TSD, boundary, orientation, age and superfamily ? How to determine which step does not meet the standard?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/86#issuecomment-756111347, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NHU22SMCHADZ7OYEYDSYWYFZANCNFSM4VY5SDMA .

Weihankk commented 3 years ago

Hi Shujun, Thank you for your kind reply. I have read the article and understand some about LTR_retriever. I find some candidatre in defalse is marked as 5' rLTR or 3' rLTR. What's the meaning of 5' rLTR or 3'rLTR? Is there a way to manually correct it, such as some obvious minor errors?

Seq3:84968..87748       false   motif:AGAA      TSD:ATTCC       84963..84967    87749..87753    IN:85177..87548 0.9122  ?       unknown NA      3591701
        Adjust: 5' rLTR lLTR: 210       rLTR: 201
        Alignment regions: 6, 210, 2578, 2782
        LTR coordinates: 84968, 85176, 87549, 87748
        TSD-LTR overlap: 0
        Boundary missing: 0

Best, Weihan

oushujun commented 3 years ago

Hi Weihan,

Thank you for your interest in how LTR_retriever works. LTR_retriever tries to correct for the start and stop sites of the 3' and 5' LTR, because prediction engines are usually not accurate. Your new example says the 5' rLTR (the right side of the 5' LTR, aka., the start of the LTR) has been adjusted, and the adjusted LTR length is 210 and 201 bp. Here you can see a sign of error. We expect the LTR length should be the same given they are direct repeat. So if the length difference exceeds a certain threshold (I forget if it's 10 or 30bp), it will be classified as false.

Another obvious sign of error is the motif:AGAA. We never see an authentic LTR carrying such motif. 99% of them are TG..CA, and the rest of them still start with T. So this sign alone can determine this is a false candidate.

Other signs, such as the lack of coding information, the unknown strand, and the strange LTR coordinates, all these signs suggest this is not an authentic LTR element.

Hope these help!

Shujun

On Fri, Jan 8, 2021 at 8:15 PM Weihan notifications@github.com wrote:

Hi Shujun, Thank you for your kind reply. I have read the article and understand some about LTR_retriever. I find some candidatre in defalse is marked as 5' rLTR or 3' rLTR. What's the meaning of 5' rLTR or 3'rLTR? Is there a way to manually correct it, such as some obvious minor errors?

Seq3:84968..87748 false motif:AGAA TSD:ATTCC 84963..84967 87749..87753 IN:85177..87548 0.9122 ? unknown NA 3591701 Adjust: 5' rLTR lLTR: 210 rLTR: 201 Alignment regions: 6, 210, 2578, 2782 LTR coordinates: 84968, 85176, 87549, 87748 TSD-LTR overlap: 0 Boundary missing: 0

Best, Weihan

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/86#issuecomment-756725414, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NFAH46455OUVF5WKQTSY3ZNNANCNFSM4VY5SDMA .

Weihankk commented 3 years ago

Hi Shujun,

Thank you for your enthusiastic and timely reply. I assembled dozens of reference genomes for a plant. All of the whole genome LAI values > 20 expect one genome is ~16. This makes me with some obsessive-compulsive disorder sometimes feel uncomfortable (I really want to provide a dozens of high quality genomes for other researchers, haha). 😄

This genome is sequenced by PacBio about 120X. I perform another sequence by PacBio on the same individual, this time I obtained about 500X PacBio long reads! I assembled them with different software and different reads length cutoff. However, the LAI value is stable(~ 16-17).

Hence, from my experience, I think LAI is very robust 👍 . My assembly from different depth (120X vs. 500X), different assembly method (CANU/Falcon/NextDenovo) and different assembly parameters all showed the same LAI value. I guess this may due to my 120X data and 500X data both sequence from the same DNA library. Maybe during the DNA capture stage before sequencing, few complete DNA is captured. No way to get a complete result sequencing from a relatively incomplete DNA library. But LAI value ~16 is not bad at all as you say in another issue #71 and other pubulished research result. So I won’t spend too much time at this stage. All genomes have reached 20, and it is a pity that only one is below 20. Scientific research is like our life, nothing is ever perfect in this world. At first I wanted to try whether to construct a special software or script to polish the genome by LAI value. But my knowledge level is not enough, and this is not the focus of my research.

Here I just share my experience and thoughts, hoping to help other researchers and the development of LTR_retriever. Thank you for developing such powerful software.

Best regards, Weihan

P.S. If you need my data for testing or want to know other details, please feel free to contact me whzhang@webmail.hzau.edu.cn

oushujun commented 3 years ago

Dear Weihan,

Thank you for sharing your experience. I am sorry this thread slips away from my focus. For sequencing dozens of genomes for a plant, did you mean they are all in one species or different relevant species? I think your finding is interesting that the library prep stage may affect the contiguity of the assembly. While I am certainly not an expert in doing the actual sequencing, can you provide more information about your library prep? For example, did you use the same DNA extraction method and library kit for these genomes? How long are the sequence reads?

Thanks, Shujun

Weihankk commented 3 years ago

Dear shujun， First of all apologize for my bad English. I have sequenced dozens species, these species belong to the same genus (eg. wild, cultivars, landrances), and I will construct a pan-genome. Sorry I am not an expert in sequencing experiments either. But I know they all use the same kit and extraction method to build the library.

Here I give you a detailed description of the only sample with LAI < 20. This sample named SampleA. At the beginning of my project, dozens of samples were all sequenced about 120 ~ 160X by Pacbio. The subreads length also well and subreads N50 are 10 ~ 20k. This looks really good so I assembled them by CANU and obtained dozens of genomes. The LAI value of all these genomes exceed 20 except SampleA is 16. And the contig N50 of SampleA also very unusual (just 200kb). So we contacted technical experts to perform another round sequencing. This time I got about 500X new Pacbio data of SampleA, also with normal subreads length and subreads N50. I assembled SampleA and run LTR_retriever, the contig N50 improved to ~5Mb while the LAI is still ~16. Surprised me and incomprehensible. Since I have enough data, I also tried only use length > 8kb, 10kb, even 20kb subreads to run different assembly software, all the LAI values is between 16 and 17, stable as Mount Tai. According to our common sense and your articles published on Nature communication, high-depth sequencing will improve assembly quality. The contig N50 is significantly improved (200kb to 5Mb), but the LAI value no improvement.

While writing here, I thought about it again. If it is a problem with library building, the contig N50 will not be improved. I used to think there was a problem with the DNA extraction process, human factors like some experimental operations. However, I got reasonable geome size, high contig N50 and high BUSCO (99%). So my guess last time about sequencing library construction may be unreasonable. The problem that still bother me is high-depth sequencing assembly get a long contig N50 but low LAI value. Different assembly methods and parameters have an effect on contig N50, but it doesn’t seem to have effect on LAI. Even I only use subreads length > 10kb to run assembly the LAI is still ~16. I have high-depth sequencing and long reads, the LAI hasn’t improved.

Thank you for taking the time to discuss so much with me.

Best regards， Weihan