psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
54 stars 34 forks source link

--n-alleles-per-gene parameter not working #273

Closed scharch closed 5 years ago

scharch commented 5 years ago

According to the help message:

  --n-alleles-per-gene N_ALLELES_PER_GENE
                        number of alleles to assume per gene when removing spurious alleles during parameter caching (default is set in processargs.py -- 1 if looking for new alleles, otherwise 2) NOTE difference to --n-sim-alleles-per-gene (default: None)

I'm not sure how to understand the apparent conflict between the default being set in processargs.py vs being None, but when I ran partis with defaults, I got 3 alleles each of IGHV3-23, 4-34, 4-39 and 4-59. So I reran partis with --n-alleles-per-gene 2 --and got five alleles of 4-34 and four each of 4-39 and 4-59. And, while 3-23 now only reports two alleles, third alleles have popped up for 1-69, 4-61, and 5-51.

Is there another parameter affecting this that I've missed?

psathyrella commented 5 years ago

Not sure I'm understanding, but maybe: --n-alleles-per-gene only affects the allele removal step, and not the subsequent allele inference step[s]. A solid argument could be made that there should be an option to enforce certain number of alleles per gene, but I haven't actually done it (well, to be more precise, there used to be one and I haven't put it back in) because I want to actually apply a sensible prior on the number of alleles for each gene, and I was waiting to get around to that.

But since that won't happen for a while, I should totally make --n-alleles-per-gene apply to the whole inference procedure.

scharch commented 5 years ago

Yeah, that explains it --and yes, it would be helpful to have a dumb fix in the interim. As to priors: there are some complex cases (esp 4-4/4-59/4-61), but there are a couple of recent papers that I think give pretty good starting points for most of the rest of the genes, at least for Europeans... https://www.sciencedirect.com/science/article/pii/S0161589017300810 https://www.biorxiv.org/content/early/2018/08/19/314476

psathyrella commented 5 years ago

ah, cool, more info for priors is great, thanks.

And adding a bit more context, the reason I removed it from the inference step (and why I'll probably leave it off by default) is that the imgt-named gene/allele distinction just isn't a good enough approximation for the general case. I.e. there's too many instances where alleles of different genes are closer than alleles of the same gene, and vice versa. And the consequence of having on a super strong prior like that by default is you miss any kind of atypical gene duplication deletion event that pops up in individual samples.

psathyrella commented 5 years ago

oh, yeah! the mosaic paper. That paper was super exciting (we were reviewers) for exactly this reason.

scharch commented 5 years ago

the imgt-named gene/allele distinction just isn't a good enough approximation for the general case

Oh, I totally agree and understand that part of the prior will be simulating/benchmarking how well you are able to distinguish alleles of different genes. But the dumb fix is still better than reporting 3 (or 5!) alleles of 4-34, especially for a novice user who might not think to curate further.

scharch commented 5 years ago

Actually, I'm still not completely understanding how this works. Based on your description, it seems like with --n-alleles-per-gene 1 (ie the default), the genotype will be called homozygous unless there's a novel allele detected. Indeed:

<defaults>
IGHV1-46*01,2644
...
IGHV4-34*01+G270A,136
IGHV4-34*01,16715
IGHV4-34*01+G116A,126

but

<with --n-alleles-per-gene 2>
IGHV1-46*01,2355
IGHV1-46*02,309
...
IGHV4-34*02,519
IGHV4-34*01+C6T,55
IGHV4-34*01+G270A,160
IGHV4-34*01,16243
IGHV4-34*01+G116A,143

(note that 4-34*02 has more assigned sequences than any of the inferred alleles) This is problematic, if the second database allele is real.

On the other hand:

<with defaults>
IGHV1-69*10,878
IGHV1-69*13,3958
...
IGHV3-23*01,3818
IGHV3-23*04,6415
IGHV3-23*04+C148T.T152C.C234G,177
...
IGHV4-61*02,8658
IGHV4-61*08,119

...so sometimes this works the way I would want it to? I presume that in these cases the evidence is so strong that partis is re-inferring the second database allele anyway? I guess that means there isn't strong evidence either way for 1-46*02 (eg) so it depends on your prior?

psathyrella commented 5 years ago

Yeah, sorry, I should have said this above (but I forgot), but this is because the allele removal step doesn't act on imgt name-based groupings of alleles into genes, but instead on groups of alleles/genes/whateveryouwantocallthedarnthings from the entire germline set, grouped by edit distance less than eight (I think). I.e. grouped such that shm can confuse genes within, but not between, groupings. This is definitely how allele removal should work, but as you've discovered the problem is that the option that controlled it is poorly named. Well, at the time it wasn't that dumb of a name, it's just the whole thing wasn't designed with applying an imgt name-based prior on alleles per gene. I'll have this sorted in a day or two so you can definitively do that, though.

scharch commented 5 years ago

I get that, but that wasn't really the (most recent) question. Clearly (I think!) if two alleles of 4-61 are being inferred with --n-alleles-per-gene 1, that means that means that (at least) one of them is actually inferred, rather than being derived from the database. On the other hand, 1-4602 and 4-3402 are only reported as being present with --n-alleles-per-gene 2, which I assume means there's not enough evidence to infer them starting from a prior of only the 01 allele being present. But there are fewer sequences assigned to 4-6108 than there are to 1-4602 or 4-3402, so I don't know how to interpret the strength of the evidence (or lack thereof) for those latter alleles. I guess the sequences assigned to 4-6108 are unmutated relative to that germline so fewer are needed to reach significance, while those from 1-4602 and 4-34*02 (if they are, indeed, present) are more mutated and so the evidence is weaker even with more sequences?

Maybe I'm asking for too much certainty where it isn't possible.

psathyrella commented 5 years ago

ok, I think I still don't follow quite all the twists and turns, but I think I can answer nonetheless, first by saying: --n-alleles-per-gene should never, ever be set greater than one if you're doing new allele inference. This is because if you set it to, say, two, the allele removal step will simply take the top two most common genes within each group. But as shown by the vast swaths of red in Fig 4 here, it's super common for the second most common gene to be spurious, such that in practice you have to treat it as though there's no evidence for any gene except the most common within each group. Further, I've never run any of the new allele inference with more than one initial allele per group, and I can't predict exactly what would happen, but it wouldn't necessarily be clever about adjudicating between the two.

Clearly the way the option is named now, this is super confusing, and the help doesn't make clear that the value set by --n-alleles-per-gene is really more of an internal variable than something the user should use to control behavior.

A couple other notes that are related, just to be clear

ok! but getting to what you actually want. No, it's unfortunately nowhere near the point where we can get actual propabiistic numbers for the likelihood of various genes, and probably can't be. But there's definitely ways to get a feeling for the likelihood of various inferences. First, turn on --debug-allele-finding. This prints out a bunch of output of basically all the counts that are used during allele finding. Second, set a --plotdir (and probably also --only-overall-plots for speed, since otherwise it writes the full per-gene parameter counts for other things). This will write out all the mutation accumulation plot fits, and organizes them into an html file under sw/allele-finding/try-0.html, it'll look like this: p and you probably know how these work? lmk if not, but the allele finding is just comparing goodness of fit for the one-piece green model to the two-piece red model. If the green is really terrible and red is ok, it calls a new allele. The numbers in each bin are printed exactly in the debug output.

psathyrella commented 5 years ago

Oh, also, since it sounds like it might be related -- do you remember the naive sequence uncertainty we were talking about when you were trying to get uncertainties for different gene calls within a clonal family? I rewrote that so in addition to getting the naive seqs for every sub cluster during the partioning, it also gets all the genes. So the output can now count up how many times each gene was used for how many sequences in how many clusters. E.g. for a simple example:

p

scharch commented 5 years ago

Thanks, Duncan, this has been very helpful and I really appreciate the time you have taken to go through it with me. I still think a dumb-fix allele limit would probably be useful, but you've convinced me that it's more marginal than I originally thought, so I will go ahead and close the issue.

psathyrella commented 5 years ago

ok added a dumb-fix thing ^ on dev. There's a fair number of caveats (e.g. there isn't really a good way of deciding which inferred allele is most likely, and alleles-per-gene is just determined by imgt name, so it'll fail for those cases where these are out of whack), but it does accomplish the goal of skipping inferred alleles after you've already got a certain number with the same IGHVx-y.