oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
337 stars 73 forks source link

Question about filtering #183

Closed SolomiyaHn closed 3 years ago

SolomiyaHn commented 3 years ago

Hello Shujun,

I have been reading about the EDTA pipeline (the benchmarking paper and the supplementary methods). I have a few questions about the filtering steps. I am a bit confused by which filtering steps are included in the EDTA pipeline and which ones were only used for the benchmarking process.

I have taken some notes on the sequence of the EDTA filtering process. Please correct me if it is wrong, or if you could point me towards some more information on the filtration process, that would be very helpful.

For Helitrons

For both Helitrons and TIRs: Basic filtering

Advanced filtering

This output library is then used to mask the genome and Repeat modeler is run on the unmasked parts of the genome to supplement the library further.

Thank you, Solomiya

oushujun commented 3 years ago

Hi Solomiya,

Thank you for reading the paper and supplementary file carefully. I copy your list of filters and annotate with my thoughts. For questions I will answer them starting with A:

For Helitrons

For both Helitrons and TIRs: Basic filtering

Advanced filtering

You may want to read the codes to have a better understanding of how these steps are arranged. You may find more parameters and treatments in the code since the program has been updated frequently based on feedback.

Let me know if you have more questions.

Best, Shujun

SolomiyaHn commented 3 years ago

Thank you for your quick response! It is very helpful. However, I still don't fully understand the logic behind the filtering step that removes false positives by comparing the terminal end copy number to the insertion site copy number. I would like to understand the different possibilities that the criteria are differentiating between.

If only one terminus has over 5 or 20 (Helitron: 5; TIR: 20) full length copies, are we assuming that the more common terminus is located within the sequence of another TE meaning it is a FP?

If both termini have over 5 or 20 full length copies, does this mean that either the TE is nested within another TE or it is a false positive within another TE?

If both termini add up to being considerably more common (>10000 times) than 2 times the abundance of the insertion site (but less than 50,000 copies each) then it is a false positive. What is the logic behind this criteria? What could be causing such a high ratio of the terminal sequences to undisturbed target site if it is not a nested TE?

Thanks, Solomiya

oushujun commented 3 years ago

Hi Solomiya,

Good questions. The behind logic is the structures of TEs which are used for de novo identification. For LTR retrotransposons, their structure is -> --- ->. For TIRs, their structure is -> --- <-. For Helitrons, it's only TC---CTRR. These structures not only identify the target TEs, but are also reporting sequences with similar features. For example, if there are two LINE elements placing close to each other in the same direction, then it looks like an LTR element!

Annotation programs also use other features to narrow down the possibility of false reporting, such as TSD and motifs for LTRs. It helps at some point, but these features can also be found in other TEs by chance due to their simplicity (TSDs are usually 5bp in LTR, motif is TG...CA). So frequently you will end up in scenarios like this: LINE1_part--TSD_alike--motif_alike--direct_repeat-----direct_repeat--motif_alike--TSD_alike--LINE2_part. It seems like there is an LTR element here, but it is actually composited from parts of two LINE elements - a false reporting.

What's the difference between a real LTR and a false case like this? The repeatness of the flanking region! You see the candidate is picked from two TEs, so its flanking sequences are also part of these TEs, which means they are also repetitive. For a real LTR element, the insertion site is less likely to be repetitive, so that's why we check the flanking copy number to distinguish.

For rare cases, an intact TE can be found in other TEs due to nesting. That why we use other rules to make sure this is a real nesting. For upper limits, we have a belief here: if a TE is nested within another TE, then this structure is less likely to make copies of the entire thing and highly unlikely to achieve a high copy number. A much more likely scenario: this is a false candidate picked from a highly repetitive sequence, thus false reporting.

You may want to read the LTR_retriever paper for more explanations of the justification. These criteria also apply to other TEs with parameters slightly adjusted based on curations.

Best, Shujun

SolomiyaHn commented 3 years ago

Hi Shujun, This makes sense! Thank you for the explanation. I have read parts of the LTR_retriever paper and it has also helped me understand some of the other filtering steps in EDTA.

Best, Solomiya