rmhubley / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
214 stars 48 forks source link

Is it possible to regenerate Kimura distances from gff? #189

Closed lukesarre closed 1 year ago

lukesarre commented 1 year ago

Hi all,

We are working with a repeat annotation in .gff format that was generated with RepeatMasker, but has since undergone some manual filtering.

The original alignment files have been lost, but as well as our annotation .gff (which includes the family ID for each repeat) we do also have the reference library fasta and of course the genome fasta.

In theory I would imagine it should be possible to extract the repeat sequences from our genome using the gff, align them to the relevant sequence from the reference library, and generate Kimura distances for each locus.

Is there a way to use RepeatMasker in this way? It would be much simpler for us than restarting from the beginning to generate the alignment files and repeating our manual filtering.

Thank you in advance, Luke

P.S. In our gff, the sixth column is a 'score' column. What does this score refer to? It would save a lot of hassle if it is indeed the Kimura distance!

rmhubley commented 1 year ago

Sorry for the delay. The quick answer is that it would be very difficult to reconstruct the alignment data from the GFF to reproduce what RepeatMasker originally gave you. That is because the GFF format doesn't include information on how that alignment was obtained including the matrix used, the gap parameters, the GC background of the genomic sequence batch the alignment was found in etc. The score you refer to is a product of those parameters and will not directly translate into a Kimura distance. However, as you point out you could re-align the GFF regions and the corresponding family using your own set of alignment parameters and derive a distance (using any metric) from that data. I suspect that doing so will not be as easy as simply re-running RepeatMasker.

lukesarre commented 1 year ago

Thank you for the very helpful response, and also thank you for developing this useful tool! Luke