rmhubley / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
214 stars 48 forks source link

Will N have an impact on blast n if RepeatMask uses hard masking and generates N from repeated sequences? #260

Open helloworldABCD1234 opened 2 weeks ago

helloworldABCD1234 commented 2 weeks ago

Will N have an impact on blast n if RepeatMask uses hard masking and generates N from repeated sequences? For example, is ATCGGGCTNNTTT the same sequence as ATCGGGCTTTT? Or is it true that ATCGGGCTNNNNTTT and ATCGGGCTTTTT have the same effect in inputting blastn

rmhubley commented 2 weeks ago

This is really a question about scoring matrices/gap parameters more than about rmblastn. RepeatMasker uses scoring matrices in which a substitution from N to any other base is slightly penalized (-1 ). This will easily align bases to the Ns for short distances in the cases where they correctly span between two non-N strings, and will terminate alignment if they are too long ( perhaps generating another alignment for the non-N sequence following it ). The gap open/extension penalties also play in to this. They are much higher than the N substitution penalty and therefore will not often span the N's with a gap.

For example, if I use your example and an absurdly low cutoff score, I get the following with a similar matrix/gap parameters:

72 0.00 0.00 0.00 t3 1 11 (0) t1 1 11 (2)

  t3                     1 ATCGGGCTTTT 11
                                   ?? 
  t1                     1 ATCGGGCTNNT 11

Does that answer your question?