rmhubley / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
214 stars 48 forks source link

The meaning of milliDiv #178

Closed ZunpengLiu closed 1 year ago

ZunpengLiu commented 1 year ago

Hello,

Thank you very much for developing and managing this strong software! I have a question about the output file of RepeatMasker. I really want to know what is the meaning of the milliDiv (= Base mismatches in parts per thousand). Does it mean the difference/diversity between different loci of elements from the same RE subfamily?

I would be very grateful if you can explain more in detail.

Thanks!

Best,

Zunpeng

rmhubley commented 1 year ago

I believe you must be referring to the 2015 UCSC genome track we built for hg38/mm10. This was a field in their database schema that stored the percent divergence data from a typical RepeatMasker .out annotation line. The percent divergence in a RepeatMasker .out file is calculated as the number of substitutions in the sequence alignment divided by the length of the aligned genomic sequence. It's typically reported to one decimal place for example a line in the *.out file:

  292   33.8  3.9  1.5  seq-13    3864604 3864732 (1150141) + MIRb          SINE/MIR             3    134  (134)   1  

Here the MIRb annotation is from an alignment that has 0.338 or 33.8 % substitutions. In the data structures supporting that track we needed to use an integer to store this floating point value. Therefore the field was multiplied by 1000 and the remaining fractional component dropped to store the value in their database.