psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
54 stars 34 forks source link

Simulate functional repertoire only #248

Closed scharch closed 5 years ago

scharch commented 6 years ago

I am seeing a lot of stop codons generated, both in CDR3 by the simulated recombination and in V by simulated SHM. At the moment, I am only interested in the functional repertoire, though, so it would be nice to be able to tell partis to consider those sequences invalid.

psathyrella commented 6 years ago

Yes, this should be an option. It'll just involve throwing out a bunch of sequences (like half of them I think? stop codon combinatorics kind of suck) after the fact, since bppseqgen isn't going to know about stop codons.

quick fix: add + tuple(functional_columns) to the end of this line, and then the stops column will be added and you can grep on it or whatever.

scharch commented 6 years ago

It's not relevant to my current use case, but wouldn't just throwing them out post hoc mess up your tree statistics for simulating the clonal structure of a repertoire?

psathyrella commented 6 years ago

yep :(

I'd probably just have to keep rerunning the bppseqgen step until it gives me something with no stop codons... but for large families that will be never. So maybe just revert the stop codons? This is how I deal with mutated cyst/tryp/phen invariant codons, but they're much less common, and there's already an issue to stop doing it that way. Reverting all stop codons would probably screw up the shm characteristics too much.

So the answer might be that if you specify only functional sequences, you'll only get a fraction of the original tree -- which is biologically what's happening, so also the cleanest to code up.

scharch commented 6 years ago

you'll only get a fraction of the original tree -- which is biologically what's happening

Depends how you think about this. If the tree parameters are learned from the functional repertoire (ie those that have already survived selection) then I don't think that logic applies. OTOH, I don't have a better solution, at the moment....

psathyrella commented 6 years ago

Well, once I start learning clonal family properties from data and propagating to simulation... but partis doesn't do any of that yet.

psathyrella commented 2 years ago

Ended up having to implement this since enclone throws away all non-productive sequences, and has neither a way turn this off nor any plans to change it. I decided to add new mutations to each stop codon in the final sequences until it's no longer a stop, completely outside of the actual mutation model/initial tree -- just uniform random on the three positions in the codon and the three potential nucleotides. So, it's wrong, but it's just adding a little bit of fuzz at the ends of branches, so seems less wrong than the other ways I could think of. (Wrong in that this simulates stop codons being introduced, and persisting through the tree until just before the very tips, where they're mutated away; whereas in reality they should get reverted/mutated almost immediately). At least it doesn't slow things down at all. Ideal, of course, would be if there was a way to tell bppseqgen to disallow stops, but I think that isn't possible.

https://github.com/psathyrella/partis/blob/dev/bin/partis#L1447