Closed proteneer closed 6 years ago
PS the main issue is that we're having some difficultites trying to reproduce the gdb-10 results, (we've repro'd the val/test results).
Okay, a couple of things. 1) The network in this repo is not the same one as that from the paper. This one was trained to the ANI-1 data set + some amino acid and peptide data. Also, through hyper parameter searching we determine the AEV parameters used here work just as well as for the 768 sized AEV on the ANI-1 + peptide data set. 2) In the paper we trim energies > 300kcal from each set of conformers minimum for the GBD-10 test. This may not have been explicitly mentioned in the paper, but is clear from the range in figure 4 that this is what we are comparing. The high energy GDB-10 stuff is VERY hard to fit to if you are using the trimmed (@275kcal/mol) version of the ANI-1 data set (which is what we used in the paper and recently published as the "low" energy part of the ANI-1 data set).
As it turns out I recently built an ensemble of original ANI-1 networks (5 of our model trained to a 5 fold cross-validation style split of the ANI-1 "low" energy data set) to compare on a new benchmark I have been developing. The new networks were developed with the same parameter file used in this repository. For the ensemble we get a prediction of 1.7kcal/mol RMSE. You can view these results here (this notebook will also show you how we do the comparison):
https://github.com/Jussmith01/ANI-Tools/blob/master/notebooks/eval_testset.ipynb
If you'd like me to make the ANI-1 ensemble available on this repo for comparison I can do that.
@Jussmith01 Thank you for the very detailed explanation and the notebook. We've confirmed internally and our test scores become significantly better after pruning the high energy conformations. For many of the applications we care about, we typically only consider the conformations in <100kcal/mol range (you report using 300kcal/mol).
We did some analysis on the training set as well, of the 22 million conformations you provide, about 6 million of them have >100 kcal/mol energy differences from the minimum. It looks like this dataset has a fairly large number of outliers, some with rather interesting geometries (smaller C=O bonds, as an example).
6M > 100kcal/mol of the 22M sounds about right. With regular normal mode sampling it will by default bias conformations towards energy minima. We have since refined our methods and have a soon to be submitted paper that covers this topic a little. As for weird geometries, it can happen when using a harmonic approximation to determine the structural perturbations. However, it is a very cheap way to generate non-equilibrium conformations and from what we have seen it works well when you filter out high energy conformations (which tend to be the weird structures).
I noticed that the parameters in
https://github.com/isayev/ASE_ANI/blob/master/ANI-c08f-ntwk/rHCNO-4.6A_16-3.1A_a4-8.params
differ from what's in the paper
Namely, this produces a feature vector with 384 as opposed to the ~700 floats mentioned in the paper. In addition, the NN it self seems to be 256x128x64x1 as opposed to 128x128x64x1
Do you mind clarifying what are the canonical parameters are needed to reproduce the paper given the dataset?