psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

Handle with multiple annotations #227

Closed bernste1 closed 6 years ago

bernste1 commented 7 years ago

In the HMM parameters initial estimation step you used Smith-Waterman based methods, in this step it may be multiple annotations for V, D or J (per read), especially in D, how did you handle these annotations? Also there are d genes that are identical (identical sequence) like ighd4-1101 and ighd4-401, did you choose randomly one of the annotation? Thanks

psathyrella commented 7 years ago

The Smith-Waterman returns a list of the best-matching genes for V, D, and J, and in practice the parameters are set such that it returns every remotely plausible match. The hmm is then told to choose from among the n best s-w v matches , m d matches, and k j matches (set with --n-max-per-region n:m:k, default is 3:5:2). The hmm's best match is written to the output csv, along with a column for each region listing the per-gene support for other matches as a decimal number between 0. and 1.

I decided that having multiple gene names for a single sequence was, at least in the context of sequence analysis, an assault against all that is Right and Good in this world, so, yes, I removed one at random in the default germline set in data/germlines/. So if you want to make a different choice, you can either modify data/germlines/ or specify a different germline set with --initial-germline-dir.