jbloomlab / SARS-CoV-2-RBD_DMS

Deep mutational scanning of the receptor-binding domain of SARS-CoV-2 Spike
BSD 3-Clause "New" or "Revised" License
43 stars 17 forks source link

add `analyze_counts` notebook #22

Closed jbloom closed 4 years ago

jbloom commented 4 years ago

Adds some qualitative analyses of counts / mutation frequencies for different samples.

Most things make sense to me. The one thing I found interesting (see what you think when you look at the analyze_counts notebook) is that for most TiteSeq concentrations the average mutations per variant goes down with bin number as expected if most mutations are deleterious. But this stops being true around bin 12. Is this expected? Does it indicate that the last bins are mostly selecting for beneficial mutations? Or is it just noise due to low counts in those bins at high concentration?

Also, some counts slightly change when re-running count_variants.ipynb probably due to a difference in environment between me and @tylernstarr ?

Anyway, @tylernstarr, can you review and merge. As far as resolving we we get slightly different variant counts, at some point (maybe like before you go to bed) you might re-build the environment and re-run the whole pipeline. I suspect that there are slightly different versions of some software. Alternatively, this is not very important to resolve right now (so don't prioritize) except that currently each of us will get slightly different results from every step we run as we have slightly different counts we're starting with.

tylernstarr commented 4 years ago

I think I see what part you're talking about. Around sample 12 is definitely when, in general, hardly any cells are collected in bin3 (~5000 cells) and bin4 (~1000 cells) as the ACE2 labeling concentration is getting very low. In theory, if any cells are still enriched in bin3 or bin4 at those low concentrations, that would suggest beneficial mutations. I've always assumed the 5000 and 1000 cells getting sorted into those bins at those concentrations are mostly labeling/sorting noise though (since e.g. approximately the same number of cells are sorted into bins 3 and 4 at the 10^-13M concentration as in the 0M baseline sample), so that probably dominates the plot you're showing here. There still could in theory be a small number of variants that are actually consistently in bin3 and bin4 down to lower concentrations than wildtype, but I guess this shows that if there are any, they are going to be the exception. Which I would say is sort of what I might expect. (There could still be a decent number of moderately affinity-enhancing muts that would shift the curve by e.g. a half-log or so, this is moreso saying there's not a large number of mutants that enhance affiniity by >1 order of magnitude.)