Question about filtering

Hello Shujun,

I have been reading about the EDTA pipeline (the benchmarking paper and the supplementary methods). I have a few questions about the filtering steps. I am a bit confused by which filtering steps are included in the EDTA pipeline and which ones were only used for the benchmarking process.

I have taken some notes on the sequence of the EDTA filtering process. Please correct me if it is wrong, or if you could point me towards some more information on the filtration process, that would be very helpful.

For Helitrons

filtering out helitron candidates that don’t have a 5’-TC…CTRR-3’ terminal structure
- (is this just part of the helitronscanner program?)
format_helitronscanner_out.pl
- filtering out helitron candidates that don’t have an AT or TT target site
- minimum score of 12
cleanup_proteins.pl (using LTR_retreiver included databases of DNA TE, LINE and/or plant coding sequences)
- remove candidates aligning (blastx) to min 30 aa covering >=70%

For both Helitrons and TIRs: Basic filtering

cleanup_tandem.pl (using Tandem Repeats Finder)
- remove Tandem Repeats
- remove missing characters (N’s)
- remove helitron candidates shorter than 100bp
- remove TIR candidates shorter than 80bp
change TIRs to MITEs if shorter than 600bp
Remove redundant sequences (which script is used for this step?)
cleanup_nested.pl
- default parameters
- 5 iterations
compare terminal sequence copy # to insertion site copy #
- remove candidates with only one terminal sequence with high copy# (FP)
- if both terminal ends are abundant but not more than 20,000 times the insertion site, it is a nested TE (why the 20,000 cut off?)
filter out candidates with more than 15/20 terminal basepairs are SSR’s
remove all SSRs from the candidates

Advanced filtering

Reciprocal filtering between LTR, Helitron and TIR libraries (are LTRs included in this step or not since LTR stage0=LTR stage1)
combining all sublibraries into one
Removing nested insertions again. Is it 5 iterations again?
clustering (to what degree are they clustered? Is each sequence representative of a single family as per the Wicker system?)which script is used for this step?

This output library is then used to mask the genome and Repeat modeler is run on the unmasked parts of the genome to supplement the library further.

Thank you, Solomiya

Hi Solomiya,

Thank you for reading the paper and supplementary file carefully. I copy your list of filters and annotate with my thoughts. For questions I will answer them starting with A:

For Helitrons

[x] filtering out helitron candidates that don’t have a 5’-TC…CTRR-3’ terminal structure
- (is this just part of the helitronscanner program? A: yes.)
format_helitronscanner_out.pl
- [x] filtering out helitron candidates that don’t have an AT or TT target site
- [x] minimum score of 12
- [x] Also: keep the shorter terminal if multiple matching termini presented
~~cleanup_proteins.pl (using LTR_retreiver included databases of DNA TE, LINE and/or plant coding sequences)~~
- ~~remove candidates aligning (blastx) to min 30 aa covering >=70%~~
- Details of this step is correct, but happens during Advance filtering

For both Helitrons and TIRs: Basic filtering

[ ] cleanup_tandem.pl (using Tandem Repeats Finder)
- [x] remove Tandem Repeats
- ~~remove missing characters (N’s)~~ remove candidates with 50kb or 90% missing (tandem repeat converted to Ns)
- [x] remove helitron candidates shorter than 100bp
- [x] remove TIR candidates shorter than 80bp
~~change TIRs to MITEs if shorter than 600bp~~ Not anymore, to avoid confusion.
[x] Remove redundant sequences (which script is used for this step? A: the cleanup_nested.pl script)
~~cleanup_nested.pl~~ this script is used in Advance filtering
- [x] default parameters
- ~~5 iterations~~ it now can iterate automatically until saturated.
[x] compare terminal sequence copy # to insertion site copy #
- [x] remove candidates with only one terminal sequence with ~~high copy#~~ more than x number of copies (x = Helitron: 5; TIR: 20) (FP)
- [x] if both terminal ends are abundant but not more than 20,000 times the insertion site, it is a nested TE (why the 20,000 cut off? A: this number is determined based on manual curations. The criteria have been updated to determine FP: ($end5_count + $end3_count)/(2*$flank_count) < 10000 and $end5_count < 50000 and $end3_count < 50000)
[x] filter out candidates with more than 15/20 terminal basepairs are SSR’s
[x] remove all SSRs from the candidates

Advanced filtering

[x] Reciprocal filtering between LTR, Helitron and TIR libraries (are LTRs included in this step or not since LTR stage0=LTR stage1 A: yes, they are clean enough.)
[x] combining all sublibraries into one
[x] Removing nested insertions ~~again. Is it 5 iterations again?~~
[x] clustering
- to what degree are they clustered? A: a slightly modified parameter: -minlen 80 -miniden 80 -cov 0.95
- Is each sequence representative of a single family as per the Wicker system? A: Mostly yes. For LTR sequences, the LTR region and internal region are separated but with the same family name.
- Which script is used for this step? A: It's cleanup_nested.pl. This is a nice invention and it does many things.
[x] This output library is then used to mask the genome and Repeat modeler is run on the unmasked parts of the genome to supplement the library further.
- Identified candidates are further cleaned by the LTR-HEL-TIR library (existing sequences are removed) and TEsorter.
- Redundancy is removed using cleanup_nested.pl

You may want to read the codes to have a better understanding of how these steps are arranged. You may find more parameters and treatments in the code since the program has been updated frequently based on feedback.

Let me know if you have more questions.

Best, Shujun

Thank you for your quick response! It is very helpful. However, I still don't fully understand the logic behind the filtering step that removes false positives by comparing the terminal end copy number to the insertion site copy number. I would like to understand the different possibilities that the criteria are differentiating between.

If only one terminus has over 5 or 20 (Helitron: 5; TIR: 20) full length copies, are we assuming that the more common terminus is located within the sequence of another TE meaning it is a FP?

If both termini have over 5 or 20 full length copies, does this mean that either the TE is nested within another TE or it is a false positive within another TE?

If both termini add up to being considerably more common (>10000 times) than 2 times the abundance of the insertion site (but less than 50,000 copies each) then it is a false positive. What is the logic behind this criteria? What could be causing such a high ratio of the terminal sequences to undisturbed target site if it is not a nested TE?

Thanks, Solomiya

Hi Solomiya,

Good questions. The behind logic is the structures of TEs which are used for de novo identification. For LTR retrotransposons, their structure is -> --- ->. For TIRs, their structure is -> --- <-. For Helitrons, it's only TC---CTRR. These structures not only identify the target TEs, but are also reporting sequences with similar features. For example, if there are two LINE elements placing close to each other in the same direction, then it looks like an LTR element!

Annotation programs also use other features to narrow down the possibility of false reporting, such as TSD and motifs for LTRs. It helps at some point, but these features can also be found in other TEs by chance due to their simplicity (TSDs are usually 5bp in LTR, motif is TG...CA). So frequently you will end up in scenarios like this: LINE1_part--TSD_alike--motif_alike--direct_repeat-----direct_repeat--motif_alike--TSD_alike--LINE2_part. It seems like there is an LTR element here, but it is actually composited from parts of two LINE elements - a false reporting.

What's the difference between a real LTR and a false case like this? The repeatness of the flanking region! You see the candidate is picked from two TEs, so its flanking sequences are also part of these TEs, which means they are also repetitive. For a real LTR element, the insertion site is less likely to be repetitive, so that's why we check the flanking copy number to distinguish.

For rare cases, an intact TE can be found in other TEs due to nesting. That why we use other rules to make sure this is a real nesting. For upper limits, we have a belief here: if a TE is nested within another TE, then this structure is less likely to make copies of the entire thing and highly unlikely to achieve a high copy number. A much more likely scenario: this is a false candidate picked from a highly repetitive sequence, thus false reporting.

You may want to read the LTR_retriever paper for more explanations of the justification. These criteria also apply to other TEs with parameters slightly adjusted based on curations.

Best, Shujun

Hi Shujun, This makes sense! Thank you for the explanation. I have read parts of the LTR_retriever paper and it has also helped me understand some of the other filtering steps in EDTA.

Best, Solomiya

oushujun / EDTA

Question about filtering #183