lh3 / srf

SRF: Satellite Repeat Finder
MIT License
82 stars 6 forks source link

Annotating high-frequence TE in repeat-rich plant genome #8

Closed baozg closed 1 year ago

baozg commented 1 year ago

Hi, @lh3

I saw one commit about SRF for filtering LTR output. But could we use SRF to annotate TEs in the high repeat content genome, such as maize, especially for pangenome? Does the SRF have an assumption for these sequences?

Another reason for this TE sequence is that the lots of plant centromeres was consist of tandem repeat and TE. I try to use srf to annotate the Arabidopsis thaliana pangenome, srf could output 22 LTR (including the LTR/Gypsy/Athila, which was centromere-specific) in the 253 of sequences.

lh3 commented 1 year ago

SRF intends to find long tandem repeat patterns. It may find centromeric TEs are tandemly repeating. SRF may occasionally find some non-centromeric LTRs by chance but it will miss many. RepeatModeller may be a tool for general TEs.

baozg commented 1 year ago

Thanks for the kind explanation. For the TE annotation of the plant genome, we prefer to use EDTA for TE annotation since accuracy and speed. Repeatmodeler could be very slow and generate a lot of unknown TE in the consensus library. But EDTA sometimes would significantly slow down by the satellite repeats on the high-quality genomes. Even though SRF is designed only for satellite repeat, we could get TE-free satellite DNA to mask before TE annotation.

Anyway, thanks for developing SRF for de novo reconstruction of the dark matter of the complete genome.