The improvement of cleanup_nested.pl

DrHogart commented 4 years ago

Hi,

I have found that cleanup_nested.pl doesn't identify the putative copies as belonging to the same family if one of the copy has several non-overlapped blastn HSPs within the other copy the length of which less than -cov threshold. Possibly this is because cleanup_nested.pl deals only with the first HSP and skip all other HSPs. This results in a bunch of redundancy in the predicted TEs. It seems that modifying this script accordingly will significantly improve the EDTA.

All the best, Sergei.

DrHogart commented 3 years ago

I have modified the cleanup_nested.pl script to deal with several HSPs between the query and the subject sequences. Actually, I didn't test how the use of the modified script change the overall performance of EDTA prediction, but it reduces the number of output sequences by 10-20%. I'll be happy to get any feedback. Cheers. cleanup_nested_mod.pl

oushujun commented 3 years ago

Hi Sergei,

Thank you for spotting the potential improvement and providing the script. I will further take a look. To clarify, HSP here means "High Score Pairing" not "Heat Shock Protein", right?

To test the performance of the modified EDTA, you may use the benchmarking platform provided in this pipeline: https://github.com/oushujun/EDTA#benchmarking. You will need to run EDTA on the rice genome and test the whole-genome annotation with the curated standard provided in ./database.

Best, Shujun

DrHogart commented 3 years ago

To clarify, HSP here means "High Score Pairing" not "Heat Shock Protein", right?

Right.

One other note: to visualize the overlapped and non-overlapped HSPs b/w query and subject I am blasting the FASTA file against itself

oushujun commented 3 years ago

Thanks for the clarification. I tried to incorporate your script in EDTA and tested it in the rice genome. The final library is 1.9Mb larger (1.5Mb from Helitron) then the original 19-Mb library, which reflects on the benchmark results - FDR of Helitrons increased from 67% (pretty bad, v1.8.4) to 80.5% (even worse..):

Category	Methods	sens	spec	accu	prec	FDR	F1
ltr	EDTA_denovo	0.942	0.990	0.978	0.966	0.034	0.954
helitron	EDTA_denovo	0.786	0.871	0.867	0.195	0.805	0.312
tir	EDTA_denovo	0.725	0.908	0.874	0.645	0.355	0.682
mite	EDTA_denovo	0.434	0.981	0.947	0.603	0.397	0.505
sine	EDTA_denovo	0.245	1.000	0.996	1.000	0.000	0.394
line	EDTA_denovo	0.268	1.000	0.984	0.941	0.059	0.417
nonltr	EDTA_denovo	0.263	1.000	0.980	0.953	0.047	0.412
classified	EDTA_denovo	0.960	0.837	0.895	0.841	0.159	0.896
total	EDTA_denovo	0.964	0.830	0.894	0.836	0.164	0.896

Can you provide some examples on how this may improve the library?

Best, Shujun

oushujun commented 3 years ago

Do you think it's the $touched_seq{$sbj}=1 scheme in the code that makes it less aggressive?

DrHogart commented 3 years ago

Thank you for the benchmark! I didn't test the script in this way.. Well, I decided to modify your script because I've noticed that after the regular cleanup_nested.pl on the library of TE sequences the is a lot of redundancy remains. To understand the nature of this redundancy I blasted the FASTA against itself and noticed that there were a lot of cases when two sequences have several non-overlapped HSPs b/w each other. Each of the HSPs has a length less than 0.8 cutoff, so such sequences were not recognized as nested ones. So, I've tired to take into account this observation using my version of the script. I can say, that if I have the library of 1230 INTs sequences, the cleanup_nested.pl reduces it to the 330 unique sequences, while my script reduces it to 298 sequences. I've tested the script mostly on TIRs and INTs (after removing LTRs and tandem repeats identified by TRF) on four species. I've introduced $touched_seq{$sbj}=1 since after removing some substring (say substring 1) from the subject sequence, its length and positions of other nested substrings (2, 3 .. ) corresponded to other TEs are changing. So, those substrings 2,3 .. cannot be correctly removed from the subject sequence in the current iteration and thus modifying the subject sequence should be frozen until the next iteration. This ban to modification if achieved by introducing the $touched_seq{$sbj}=1 statement.

oushujun commented 3 years ago

Thanks for the explanation. Can you provide some sample sequences to test on?

DrHogart commented 3 years ago

Please find the sample file below. for_testing.fasta This is INTs of EDTA predicted TEs after removing the tandem repeats and LTRs. My results:

fgrep -c '>' for_testing.fasta
# 697
perl ../../EDTA/util/cleanup_nested.pl -in for_testing.fasta -threads 8 -minlen 80 -cov 0.90 -miniden 80
# fgrep -c '>' for_testing.fasta.cln # 355
perl ../../distrib/cleanup_nested_mod.pl -in for_testing.fasta -threads 8 -minlen 80 -cov 0.90 -miniden 80
# fgrep -c '>' for_testing.fasta.cln # 296

oushujun commented 3 years ago

I got the chance to check these sequences, many of them contain short tandem repeats such as (TA)n, (AC)n, (AGCAGCAGCAGCATC)n, which are found in LTRRT_466, LTRRT_539, LTRRT_830, LTRRT_832, LTRRT_354, LTRRT_822. The modified script does help to reduce the redundancy of these parts. I made a slightly modified version to output the number of overlapping HSPs in the .stat file. cleanup_nested_mod2.pl.zip

The above benchmark has poor performance on Helitrons probably due to a previous fix (8c6975a) that recovers candidates from the negative strand, which apparently does not help to improve the results. Apparently this script helps to reduce the library size without obvious negative impacts on the annotation results. I will find a chance to incorporate it into the next release. Thanks for improving the code!

Best, Shujun

DrHogart commented 3 years ago

Hi, thanks for the feedback and further improvement of the code. It was my pleasure to make a minor impact on your cool and useful pipeline.

oushujun / EDTA

The improvement of cleanup_nested.pl #136