jbloomlab / SARS-CoV-2-RBD_DMS

Deep mutational scanning of the receptor-binding domain of SARS-CoV-2 Spike
BSD 3-Clause "New" or "Revised" License
43 stars 17 forks source link

Improve muteffects output #36

Closed jbloom closed 4 years ago

jbloom commented 4 years ago

@tylernstarr: I ran the whole pipeline, but also made some changes. They are:

  1. I broke the output for the expression and Kd by barcode estimation into one file that just has the mutants of SARS-CoV-2 (as before) and another file that just has the data for the homologs. This second file is different than what you had: before your second file had all mutants of SARS-CoV-2 as well as the homologs, and so was redundant with the first file. I think this is clearer, particularly since the only way that you ever used the second file was to read it into the single mut effects script and then just extract the homologs.

  2. I have rounded all of the estimated expressions and binding to two decimals. I don't think we believe in accuracy beyond that. Note that this then slightly changes the global epistasis fits presumably because some of the outlier points are being fit almost randomly and a tiny difference in precision matters. But looking at the single mut effects notebook, things look virtually the same. Note however that this change will require you to re-run everything downstream of count variants.

  3. For outputting the final values for the single mutants, before you were outputting both the direct measurements and the coefficients from the epistasis model, as well as whatever you chose as the "best" value. I have gotten rid of the first two, and just have the "best" value for each library and the average. I think this is better because this is going to be the final data file that most people look at, and so I don't think we want it to have a bunch of confusing stuff about coefficients that requires high-level knowledge to understand. In practice, anybody interested in global epistasis models is going to go back to the code or raw data anyway.

A benefit of these changes is also that the data files get much smaller. This is because the files with homologs expression and binding no longer duplicates all the mutant effects that are in other files. Furthermore, dropping the number of decimals from 15 to 2 on the numbers will make the files compress well.

So see if you are OK with these changes, or want to pick and choose.

There are two additional things related to this that I think you should do either as additional commits to this pull request or as new pull requests:

a. Clarify what the 1.2533 hardcoded into the error calculations for binding in the mut effects means. I couldn't figure this out---add some text explaining.

b. Is there no error estimate for the binding and expression for the single mutants of SARS-CoV-2? Right now these are output for the homologs, but not the mutations.

After you have resolved all of this and we are set for the final output data files, I will look into how to add large data files (there is a way) to the actual GitHub repo. But I will wait until this pull request and questions above are fully settled.

tylernstarr commented 4 years ago

Ok this all sounds great. I will try to commit the two points you raise above on this branch (dumb question -- how do I get "on" this branch? It doesn't show up when I check my local branches with git branch, so I'm guessing I have to git pull something?)

For the second point -- there is currently no error estimate for the binding and expression phenotypes of single mutants. We currently have just a point estimate of a mutational effect in each library, without a per-library error estimate. We could calculate a standard error of the mean effect from the two point measurements, which although statistically allowed, many people take issue with. I'm ok with it though.

For the homologs, for each per-library measurement, I felt comfortable taking a median and standard error of the median within each library, because these averages were just simply derived from multiple barcoded copies of these variants (the 1.2533 factor is apparently the relationship between SE of a mean and of a median, at least for a normally distributed value which is the assumption we make that we appear to break in some cases, but c'est la vie).

jbloom commented 4 years ago

To get the branch:

git fetch origin improve_muteffects_output
git checkout improve_muteffects_output

Don't do git pull, it might merge the branch into your current one.

jbloom commented 4 years ago

OK, the error points make sense. And I guess having two library values in there gives people some ad hoc way of estimating error by looking at both values---but I agree maybe better not to calculate standard errors on two points.

Do add a line or two to the notebook explaining the 1.2533, I did not know this.