Closed bdallen-uchicago closed 9 years ago
Similarly, are there any established conventions for scoring the ambiguous amino acid codes X, B, and Z? Averaging the scores across all the possible values seems like the best approach to me.
There is no established convention that I am aware of to address this. My approach would be to be very conservative with the score and use the lowest possible pairwise in the substation matrix. I would rather favor a false positive than a true negative
Just document what ever you choose.
Interestingly U never appears in scoring matrices, so blast replaces it with X, and it appears to not accept O at all: http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml
Are there any established conventions for dealing with invalid data when doing S-W alignment? I think a reasonable strategy is to assign a different score when one or both of the characters are invalid:
It also seems like there is a point at which the sequence is obviously garbage, and not just noisy. If you include a lot of garbage alignments in a database search, it seems like it could mess up the statistics. Having a threshold, e.g. must be greater than 50% valid codes, seems reasonable here.