uchicago-bio / 2015-Autumn-Forum

0 stars 0 forks source link

scoring alignments with invalid sequence data #51

Closed bdallen-uchicago closed 9 years ago

bdallen-uchicago commented 9 years ago

Are there any established conventions for dealing with invalid data when doing S-W alignment? I think a reasonable strategy is to assign a different score when one or both of the characters are invalid:

It also seems like there is a point at which the sequence is obviously garbage, and not just noisy. If you include a lot of garbage alignments in a database search, it seems like it could mess up the statistics. Having a threshold, e.g. must be greater than 50% valid codes, seems reasonable here.

bdallen-uchicago commented 9 years ago

Similarly, are there any established conventions for scoring the ambiguous amino acid codes X, B, and Z? Averaging the scores across all the possible values seems like the best approach to me.

tabinks commented 9 years ago

There is no established convention that I am aware of to address this. My approach would be to be very conservative with the score and use the lowest possible pairwise in the substation matrix. I would rather favor a false positive than a true negative

Just document what ever you choose.

bdallen-uchicago commented 9 years ago

Interestingly U never appears in scoring matrices, so blast replaces it with X, and it appears to not accept O at all: http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml