rmhubley / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
214 stars 48 forks source link

How to prepare helitron libray? #170

Closed xiekunwhy closed 1 year ago

xiekunwhy commented 1 year ago

Hi,

I used (https://github.com/dontkme/EAHelitron) to get helitron results from the genome I am annotating, but I don't know what results can be used to prepare RepeatMasker libray, do you have any suggestions?

Here is the output descriptions. image

Best, Kun

rmhubley commented 1 year ago

I am not familiar with this tool but reading the output descriptions you included, it appears that it generates Helitron annotations (copies of a general TE class not separated into families). I am not sure how you intend to use RepeatMasker, but if you want to identify more instances in the same genome, or perhaps search other genomes for related sequences using RepeatMasker you will first want to identify families within this set of results and build consensus or profile HMMs of those families.

To do this you could apply a clustering approach to identify groups of similar sequences, build a multiple alignment of those sequence clusters and develop a consensus for each cluster. This would give you a curated set of Helitron families upon which to conduct detailed annotation of the genome using RepeatMasker -- this is something other de novo tools do for you automatically. Alternatively you could simply use all the copies you identified as a "-lib" to RepeatMasker if: 1. the size of that library is manageable, and 2. you do not care about the details of the annotations as much as you care about finding all regions of the genome that align to this grab bag of Helitrons. Does that answer your question?