Open aerilli opened 3 months ago
Hi,
The directions of these entries are different and the physical distances between them are too far. The last two entries are close enough, but their TE coordinates substantially overlap (4910-7166 vs 6988-8240), thus they can not be considered as a single element.
Thanks! Shujun
Hey Shujun,
Thanks for the clarification! So if a substantial overlap is detected, then they cannot be considered a single element. However, it is still a bit unclear to me how this can translate into the final annotation of this region, that looks like this:
Chr5 EDTA Mutator_TIR_transposon 19872566 19873827 10111 - . ID=TE_homo_95784;Name=VANDAL21;classification=DNA/MULE-MuDR;sequence_ontology=SO:0002280;identity=0.963;method=homology;ID=TE_homo_98670;sequence_ontology=SO:0002280
Chr5 EDTA Mutator_TIR_transposon 19873825 19874206 3057 - . ID=TE_homo_95785;Name=VANDAL21;classification=DNA/Mutator;sequence_ontology=SO:0002280;identity=0.968;method=homology;ID=TE_homo_98671;sequence_ontology=SO:0002280
Chr5 EDTA Mutator_TIR_transposon 19873941 19877095 12213 - . ID=TE_homo_95786;Name=VANDAL21;classification=DNA/MULE-MuDR;sequence_ontology=SO:0002280;identity=0.966;method=homology;ID=TE_homo_98672;sequence_ontology=SO:0002280
Chr5 EDTA Mutator_TIR_transposon 19877284 19883063 18267 + . ID=TE_homo_95787;Name=VANDAL21;classification=DNA/Mutator;sequence_ontology=SO:0002280;identity=0.976;method=homology;ID=TE_homo_98673;sequence_ontology=SO:0002280
Chr5 EDTA Mutator_TIR_transposon 19883061 19884298 9665 + . ID=TE_homo_95788;Name=VANDAL21;classification=DNA/Mutator;sequence_ontology=SO:0002280;identity=0.964;method=homology;ID=TE_homo_98674;sequence_ontology=SO:0002280
Where at least in two cases the overlap is not substantial and the direction is the same.
Many thankss for your support Shujun! :)
The gff rows you pasted seem to contain extra information compared to the RM out rows. To combine rows, both physical coordinate, direction, and the TE coordinate, divergence need to be considered. If the physical coordinate, direction, and divergence meet the criteria, but the TE coordinate overlaps substantially, they are still considered two elements. If the the TE coordinates have a large distance in between and are in the agreeable directions (first piece has smaller 5' coordinates), they are still considered a single element. In such a case, the annotated TE has a large deletion.
Shujun
Hi, Shujun
Sorry for jumping into this conversation. What we don't understand is why even meet all the standard in the script, but some rows still not tjoins?
Here is the code and small working example I used:
perl combine_RMrows.pl -rmout test -maxgap 35 -maxdiv 3.5
, so same family, same strand, gap less than 35 bp and two elements divergence less than 3.5 will be joined, right?
But looking for these three rows:
# before joining
SW perc perc perc query position in query matching repeat position in repeat
score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID
30291 4.5 0.2 0.4 Chr3 17485555 17489789 (8669366) + VANDAL12 DNA/Mutator 1 4200 (9966) 64678 *
38777 2.6 0.5 0.2 Chr3 17489775 17494536 (8664619) + VANDAL12 DNA/Mutator 3442 7944 (4030) 64679
26487 1.4 0.2 0.0 Chr3 17494533 17497540 (8661615) + VANDAL12 DNA/Mutator 8849 11860 (114) 64680 *
# after joining
SW_score perc_div. perc_del. perc_ins. query_sequence query_begin query_end query_remain strand matching_repeat repeat_class/family repeat_begin repeat_end repeat_remain ID
30291 4.5 0.2 0.4 Chr3 17485555 17489789 8669366 + VANDAL12 DNA/Mutator 1 4200 (9966) 64678
34020 2.1 0.4 0.1 Chr3 17489775 17497540 8661615 + VANDAL12 DNA/Mutator 3442 11860 (114) 64679_64680
So the 64679_64680
(the ID column) was joined, but why 64678
didn't joined with 64679_64680
?
✅ Same family (VANDAL12)
✅ Same Strand (+)
✅ Overlapped (17485555-17489789 with 17489775-17497540; overlapped 14bp). How large overlap of this script will be ignored? We think it's not a substantial overlap.
✅ Divergence (4.5-2.1=2.4)
For anyone interested in these merging, the case I pasted here didn't merge is because the overlap in the repeat consensus of last four column. 1-4200
overlapped 800 bp with 3442-11860
Hi Shujun,
Thanks again for developing this amazing package! I am running the newest v.2.2. I manually increased the max divergence for fragments to be combined from 3.5 to 4.5 at https://github.com/oushujun/EDTA/blob/v2.2.0/EDTA.pl#L694 The fragments below should be combined into two distinct elements. However this seems to not happen even if they overlap. This is how the annotation looks like:
The first three and the last two fragments should be merged. The gap in between is 200bp. From my
$genome.out.new
:Do you have an idea about why this is happening? Thankss!!