scoring alignments with invalid sequence data

bdallen-uchicago commented 9 years ago

Are there any established conventions for dealing with invalid data when doing S-W alignment? I think a reasonable strategy is to assign a different score when one or both of the characters are invalid:

always use 0
average of every score in the matrix
minimum of every score in the matrix
if one is valid, score at average of every pair in the matrix involving that code, otherwise score at total average
use the gap penalty
take parameter for 'invalid penalty', defaults to any of the above

It also seems like there is a point at which the sequence is obviously garbage, and not just noisy. If you include a lot of garbage alignments in a database search, it seems like it could mess up the statistics. Having a threshold, e.g. must be greater than 50% valid codes, seems reasonable here.

bdallen-uchicago commented 9 years ago

Similarly, are there any established conventions for scoring the ambiguous amino acid codes X, B, and Z? Averaging the scores across all the possible values seems like the best approach to me.

tabinks commented 9 years ago

There is no established convention that I am aware of to address this. My approach would be to be very conservative with the score and use the lowest possible pairwise in the substation matrix. I would rather favor a false positive than a true negative

Just document what ever you choose.

bdallen-uchicago commented 9 years ago

Interestingly U never appears in scoring matrices, so blast replaces it with X, and it appears to not accept O at all: http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml

uchicago-bio / 2015-Autumn-Forum

scoring alignments with invalid sequence data #51