rmhubley / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
214 stars 48 forks source link

Benchmarking of RepeatMasker #229

Closed isabelladistefano closed 7 months ago

isabelladistefano commented 11 months ago

Hello

For the purpose of our studies, we are benchmarking some TE tools including RepeatMasker. We compare the output of RepeatMasker to the the Published TAIR Transposable Elements of Arabidopsis thaliana chromosome 1.



Parameters of RepeatMasker (version 4.0.9 ) -a -s -no_is -xsmall -nolow on the newest TAIR arabidopsis thaliana genome.

https://www.arabidopsis.org/ - TAIR publishes 7135 Transposable elements in Arabidopsis thaliana Chromosome 1



When intersecting the Repeatmasker results with the TAIR results using bedtools intersect -u -a TAIR_TEs.gff -b Repeatmasker.fas.out.gff

There are only 3951 intersections, meaning the Repeatmasker result is only representing 55.4% of the transposable elements in Arabidopsis thaliana chromsome 1. This is before looking at whether the classes/families are matching so far. 

 Please can you help us to find an explanation for this so that we can use it to safely annotate TEs of other non-model brassicacea species.



Best wishes,



Isabella

JMStorer commented 10 months ago

Greetings, Isabella!

Based on the settings you used, there doesn't appear to be a species set. If no "-species" is set, the default is to use the human-specific library. In addition, I recommend not using the "-nolow" flag, and this will likely introduce many false positive results in your output.