Clarification on uncertainty estimates for dataset metrics

jmichel80 commented 5 years ago

Hi @ppxasjsm from discussions with @maxkuhn it is also good to clarify

why the R_mean value from X resampled calculated free energies is always lower than the R value of the original dataset
Why the uncertainty estimates seem to be always larger than those reported in the FEP+ study. According to https://github.com/michellab/freenrgworkflows/blob/devel/networkanalysis/stats.py we seem to be reporting the 68% CI which should be similar to the +/- 1 sigma uncertainty estimated according to https://pubs.acs.org/doi/suppl/10.1021/ja512751q/suppl_file/ja512751q_si_001.pdf p9/10.

Yet @maxkuhn says he has processed FEP+ dataset with freenrgworkflows and obtained different R uncertainties even though similar DDG uncertainties were used. @maxkuhn please post your input with the commands you ran and the output you got to allow @ppxasjsm and myself to reproduce your findings.

maxjump commented 5 years ago

Hi @jmichel80 and @ppxasjsm ,

please find three csv files containing the experimental and calculated affinities for the Thrombin dataset reported in the specified paper. R is reported to be 0.71 +/- 0.24 ("Errors for [...] R values by use of the bootstrapping method are also reported.")

issue22.zip

R is 0.706, the R confidence as calculated by freenrgworkflows is about 0.128 < 0.282 < 0.423 (i.e. 0.282 +/- 0.154 or 0.140) when taking into account the "cycle errors" or 0.102 < 0.264 < 0.411 (i.e. 0.264 +/- 0.163 or 0.146) when assuming an error of 1.1 kcal/mol per perturbation.

Any feedback on what may be different about the bootstrapping performed in freenrgworkflows and the FEP+ paper is greatly appreciated.

ppxasjsm commented 5 years ago

First thing I noticed is that you are using absolute and not relative free energies. They are fine to use for R but not other metrics. I'll write a method that will support reading in absolute free energies.

As for the other issues, I think the behaviour I am still debugging.

maxjump commented 5 years ago

Thanks for the quick feedback. I forgot to mention that I am generating the files containing the relative binding free energies on the fly. Hence, we do not necessarily need a method implemented in freenrgworkflows to handle that. I assume you do not need the files, but just in case: issue22_relativeddG.zip

ppxasjsm commented 5 years ago

So from my unerstanding of reading the SI the analysis I implemented and the one they suggest are very similar. You end up subsampling according to a gaussian distribution, both the experimental and computed values (I in fact do not experimental variation into account.)

I use:

with sigma set to 1.1 kcal/mol for computed free energies and 0.4 kcal/mol for experimental free energies.

So my algorithm looks like this:

for n samples:
    clear lists
    for n compounds:
            generate new sample from gaussian with DDG of compound n and sigma of 1.1 or 0.4
            append sample to new list
    compute correlation between new 'fake' experimental and computed values
    and add this to list

compute the mean and standard deviation of the list of R values.

When I do this for one of your example datasets I get the following: R = 0.33±0.25 And the generated distribution looks like this:

I would naively assume, oh...why is R so much worse after the subsampling, there must be something wrong (hence the question). Actually there is a paper that answers this quite well. Realistic_samples.pdf

What I suspect has happened is, that they computed a correlation coefficient of the original data and then the standard deviation of the subsampled data. The standard deviation is around 0.25, which is exactly what they report, it just so happens that the mean of the data set due to the introduced error gets drastically shifted, this is what freenrgworkflows can report. I have attached my calculations for you to take a look.

Essentially what I am advocating is that:

both the Wang et al paper and freenrgworkflows use gaussian subsamples in an almost identical way (free energy workflows uses estimated errors on free energies to generate subsamples)
freenrgworkflows reports the mean and std, and interval from the gaussian subsamples
Wang et al reports R of the original dataset and the std of the subsampling.

The only thing is that I may misunderstand how to sample from these Gaussian distributions, but this was pretty much code copied from a script @jmichel80 had given me a long time ago.

explanation.zip

maxjump commented 5 years ago

Thanks for the detailed feedback and clarification! I get the same results using my script.

However, it does not completely match the numbers provided in the paper:

If I understand the SI correctly, the approach above (assuming errors of 1.1 and 0.4 kcal/mol) is reported as "anticipated FEP R-value" (second last line of the table). When running the uncertainty estimation (fifth cell in your notebook) 100 times, the means of R and std are 0.34 +/- 0.26 which is slightly different from the reported 0.37 +/- 0.26. Testing the same script on Tyk2 gives 0.62 +/- 0.13 which is significantly different from the reported 0.74 +/- 0.10. (input files: issue22_tyk.zip)

Furthermore, please note that there is a small difference in the standard deviation of the "observed R-value FEP" (9th line) and the "anticipated FEP R-value" (Thrombin: 0.24 vs 0.26, Tyk2: 0.07 vs. 0.10).

Hence, I assume that

we are not calculating the "anticipated FEP R-value" correctly and
the bootstrapping for the "observed" values is performed slightly differently and not according to the method described in the SI. However, in both cases I have no idea where the difference might be coming from. Any thoughts on this? @ppxasjsm @jmichel80

ppxasjsm commented 5 years ago

What I implemented in the notebook is what the SI says as far as I understand. I am not really sure what their DG_i value is other than the one they report, i.e. representing the mean of the distribution. I'll give it another careful read.

The data you gave me is the FEP+ data?

jmichel80 commented 5 years ago

Hi Max,

The small differences you see may be due to the number of repeats you carried out. Try running 1000 times to see if that changes anything. I think there is some terminology confusion in the thread.
Toni's code, which is based on some old scripts I wrote, which were based on the above mentioned Brown et al. paper, resample the distributions (not subsample).
There is no bootstrapping going on, the same population is used in every experiment. If bootstrapping was used each resampling would have a different composition of ligands and R would fluctuate even more (this is the approach used in SAMPL/D3R analyses).

Some other comments. -Based on the SI of Wang it is possible that the experimental data is resampled with sigma = 0.4 kcal/mol for the purpose of determining the maximum plausible R. And the computed FEP data is resampled with 1.1 kcal/mol for determining the uncertainty on R, but using a fixed set of experimental values which would introduce less noice than perturbing the x values by sigma=1.1 kcal/mol and the y-values by 0.4 kcal/mol.

I agree with Toni that it makes sense that the mean R drops if we introduce random noise in an initial correlation
I don't like the idea of using arbitrarily 1.1 kcal/mol because it folds in systematic (model) errors into a metric that I would prefer to interpret as the likelihood that I will get a similar result if I repeat the calculations. If we plug in statistical uncertainties into the model instead we get an uncertainty estimate closer to the latter interpretation (subject to the DDG uncertainties being robust of course).

Best wishes,

Julien

Dr. Julien Michel, Senior Lecturer Room 263, School of Chemistry University of Edinburgh David Brewster road Edinburgh, EH9 3FJ United Kingdom phone: +44 (0)131 650 4797 http://www.julienmichel.net/

On Thu, Oct 24, 2019 at 12:39 PM Maximilian Kuhn notifications@github.com<mailto:notifications@github.com> wrote:

Thanks for the detailed feedback and clarification! I get the same results using my script.

However, it does not completely match the numbers provided in the paper:

If I understand the SI correctly, the approach above (assuming errors of 1.1 and 0.4 kcal/mol) is reported as "anticipated FEP R-value" (second last line of the table). When running the uncertainty estimation (fifth cell in your notebook) 100 times, the means of R and std are 0.34 +/- 0.26 which is slightly different from the reported 0.37 +/- 0.26. Testing the same script on Tyk2 gives 0.62 +/- 0.13 which is significantly different from the reported 0.74 +/- 0.10. (input files: issue22_tyk.ziphttps://github.com/michellab/freenrgworkflows/files/3767232/issue22_tyk.zip)

Furthermore, please note that there is a small difference in the standard deviation of the "observed R-value FEP" (9th line) and the "anticipated FEP R-value" (Thrombin: 0.24 vs 0.26, Tyk2: 0.07 vs. 0.10).

Hence, I assume that

we are not calculating the "anticipated FEP R-value" correctly and
the bootstrapping for the "observed" values is performed slightly differently and not according to the method described in the SI. However, in both cases I have no idea where the difference might be coming from. Any thoughts on this? @ppxasjsmhttps://github.com/ppxasjsm @jmichel80https://github.com/jmichel80

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/michellab/freenrgworkflows/issues/22?email_source=notifications&email_token=ACZN3ZAE5AL24EMRIH5ZF6DQQGCNJA5CNFSM4JD756GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECEXBAI#issuecomment-545878145, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACZN3ZGE2MYNTGM6YMRYNL3QQGCNJANCNFSM4JD756GA.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ppxasjsm commented 5 years ago

What I don't understand is equation 1.1 in the SI. Is this just the condition that the overall sum of free energies experimentally known and computed must be the same?

maxjump commented 5 years ago

@ppxasjsm I agree that the content in your notebook should reproduce the results outlined in the SI. As we both wrote some code independently after reading the SI and do get the same result, we might well both be overlooking something. The data is taken from the excel sheet supplied in the SI.

Regarding your other comment, yes, they reweigh the computed results so that the sums of the experimental and computed results are equal.

ppxasjsm commented 5 years ago

Mmmh, so we are using the DGs they provide and are generating new samples based on the Gaussian distributions they are using...

maxjump commented 5 years ago

@jmichel80 Thanks for your suggestions. I have been running it using 1000 repeats now, but the mean and stdev do not change. I find this really strange...

Assessing the uncertainty of R in the way you suggest certainly makes sense. However, reporting the actual R value with the standard deviation of the subsampled R value does not seem completely correct to me, and I have not really felt too confident doing that. What's your view on that?

I am also not a huge fan of the 1.1 kcal/mol estimate, especially since some perturbations are clearly easier than others. I was okay with this approach if I would expect all perturbations to be of similar difficulty (i.e. the same type of transformation). For example, this overestimates the difficulty for the Thrombin dataset, as there are only fairly easy perturbations.

@ppxasjsm It may sound dumb, but would you mind extracting the data for thrombin and tyk2 from the SI yourself and run your script on it, just to ensure I am not messing something up at that early stage?

ppxasjsm commented 5 years ago

I get the same paper prediction for Tyk2 0.89 is the direct correlation coefficient of exp v computed. The std I get from subsampling is slightly larger than the 0.07, but the correlation is the same.

I also don't think we should use 1.1 kcal/mol as a general error for computed results there are for sure some datasets that are more uncertain than that and the variability from repeats seems a better indicator.

I don't think you should report R on the raw data and std from the subsampling that is misleading. Though using the subsampling you are going to get a seemingly worse correlation. Reporting them separately may be useful, but the most meaningful value is really the subsampled mean±std or confidence intervals since the distribution is not perfectly Gaussian. This is also a good read, if you haven't had a look at this yet: http://practicalcheminformatics.blogspot.com/2019/07/how-good-could-should-my-models-be.html

I can extract the data and do a consistency check yes, will do that now and get back to you.

ppxasjsm commented 5 years ago

So I had another very close look at this and looked at all the FEP+ compounds.

I can regenerated the straight up R value for all of them, but the resampled ones from the distributions I cannot get the right results. Though I also noticed somewhat inconsistent seeming results in the spreadsheet.

See for example CDK2: R is: 0.48+-0.19, but supposedly Exp R FEP is 0.73+-0.11 which definitely seems off.

Attached is my complete analysis of this, but the simple print out looks like this:

/Users/toni_brain/Desktop/issue22/proteins/bace.csv
R from statistics is: 0.535 ± 0.098, FEP+: 0.64+-0.09
Direct R is: 0.781   FEP+ R is 0.78
====================================
/Users/toni_brain/Desktop/issue22/proteins/cdk2.csv
R from statistics is: 0.242 ± 0.217, FEP+: 0.73+-0.11
Direct R is: 0.476   FEP+ R is 0.48
====================================
/Users/toni_brain/Desktop/issue22/proteins/jnk1.csv
R from statistics is: 0.662 ± 0.097, FEP+: 0.64+-0.12
Direct R is: 0.846   FEP+ R is 0.85
====================================
/Users/toni_brain/Desktop/issue22/proteins/mcl1.csv
R from statistics is: 0.602 ± 0.074, FEP+: 0.71+-0.07
Direct R is: 0.772   FEP+ R is 0.77
====================================
/Users/toni_brain/Desktop/issue22/proteins/p38.csv
R from statistics is: 0.462 ± 0.107, FEP+: 0.67+-0.08
Direct R is: 0.653   FEP+ R is 0.65
====================================
/Users/toni_brain/Desktop/issue22/proteins/ptp1b.csv
R from statistics is: 0.588 ± 0.110, FEP+: 0.79+-0.07
Direct R is: 0.803   FEP+ R is 0.80
====================================
/Users/toni_brain/Desktop/issue22/proteins/thrombin.csv
R from statistics is: 0.339 ± 0.255, FEP+: 0.37+-0.26
Direct R is: 0.706   FEP+ R is 0.71
====================================
/Users/toni_brain/Desktop/issue22/proteins/tyk2.csv
R from statistics is: 0.624 ± 0.133, FEP+: 0.74+-0.10
Direct R is: 0.891   FEP+ R is 0.89
====================================

For the notebook to work, you'll have to change the file paths, but I guess that is kind of self explanatory.

proteins.zip

My suggestion in terms of proceeding:

You can report straight up R values and compare them to the straight up R values of the FEP+ data
Use the implementation we both have come up with for the regeneration of samples both for the FEP+ data and Flare data and then compare these methodologies (See http://practicalcheminformatics.blogspot.com/2019/02/some-thoughts-on-evaluating-predictive.html for good practice to do this. )

I am at a bit of a loss to how exactly we are meant to reproduce the stated data with the information given, but I don't really see anything wrong with what we are doing. @jmichel80 What do you think?

ppxasjsm commented 5 years ago

Ah in terms of freenrgyworkflows I can incorporate the sampling of the experimental uncertainties as well, which isn't currently done. Let me know if you want me to do a quick update of the code.

maxjump commented 5 years ago

Hi Toni, thanks again for your feedback, suggestions and testing the datasets. It's really strange we cannot reprocue this, especially as the results differ quite a lot (looking at the p38 dataset for example). As you suggested, I also tend to report R and the reported R and std separately. I will look into this again once I am back from the GCC (I am off this week). In the meanwhile, it would be great if you could get the code running in freergworklfows, which hopefully should not be too much of an issue. Cheers!

michellab / FreeEnergyNetworkAnalysis

Clarification on uncertainty estimates for dataset metrics #22

Dr. Julien Michel, Senior Lecturer Room 263, School of Chemistry University of Edinburgh David Brewster road Edinburgh, EH9 3FJ United Kingdom phone: +44 (0)131 650 4797 http://www.julienmichel.net/