openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
133 stars 6 forks source link

Weirdly small DFT total energies in SPICE dipeptide dataset #60

Closed CanisW closed 3 months ago

CanisW commented 1 year ago

Hello,

I noticed that there are about 40 dipeptides whose lowest DFT energy is outrageously lower than the energies of its other conformations, which looks too weird to be correct to me. For example, as for phe-trp, the lowest three energies are around -1416, -1399, and -1371, but all other conformations have energies around -1337. I tried to reproduce the energies but failed. As seen below, almost all energies around -1337 could be reproduced by me but not the lowest three. I was wondering if it was due to some hitch during the publication of the dataset?

(The suffix numbers in "name" are the index of conformations. "spice" is the energies from SPICE 1.1.2. "psi4" is the energies obtained from my psi4 runs. "delta" is the absolute difference.) image

Thank you!

pavankum commented 1 year ago

I tried to reproduce the energies but failed.

Hi @CanisW, thank you for your interest in the dataset. Can you please drop your psi4 input files and the version you are using, I can take a look.

CanisW commented 1 year ago

The Psi4 version I used is

# Name                    Version                   Build  Channel
psi4                      1.6.1+5b9f6e3    py39h1bd450f_0    psi4

Here is the input file for phe-trp_24, which is the 24th conformation in the h5 file.

molecule {
0 1
O   0.717984676361084   -3.095458984375 2.654181718826294
O   2.446326732635498   1.9379850625991821  1.2529840469360352
O   1.547669768333435   0.484706312417984   -1.5783201456069946
C   1.324904203414917   -2.4513137340545654 1.8323588371276855
C   1.3513715267181396  2.501216411590576   1.1418089866638184
C   1.1674320697784424  -0.19642436504364014    -0.6571672558784485
C   -1.0349023342132568 4.919960975646973   -0.4869817793369293
C   -1.3647388219833374 3.665922164916992   -0.12326543033123016
C   4.379732608795166   -6.229212760925293  -0.7015548944473267
C   -4.53481388092041   3.3337202072143555  2.1125316619873047
C   -4.750723361968994  4.689075469970703   2.386181592941284
C   5.091174602508545   -5.252649784088135  0.024105509743094444
C   3.2238030433654785  -5.877880573272705  -1.3675243854522705
C   -3.4224579334259033 2.901738166809082   1.3694053888320923
C   -3.895850896835327  5.626366138458252   1.7543511390686035
C   4.628105163574219   -3.907526731491089  0.04259481281042099
C   2.7659683227539062  -4.537069320678711  -1.3440865278244019
C   -2.524872064590454  3.821657180786133   0.7466371655464172
C   -2.817704677581787  5.209049224853516   0.9263613820075989
C   3.4579014778137207  -3.5466105937957764 -0.6423216462135315
N   -1.9337800741195679 5.8115057945251465  0.09178182482719421
N   1.1463748216629028  3.7652995586395264  1.5889661312103271
N   0.3946019113063812  0.29556018114089966 0.33369699120521545
N   0.9795157313346863  -2.4547946453094482 0.5526197552680969
C   2.486135482788086   -1.6030634641647339 2.329084873199463
C   2.155008554458618   4.5512590408325195  2.24495530128479
C   -0.6733633875846863 2.3690452575683594  -0.5318013429641724
C   3.0382254123687744  -2.0663554668426514 -0.7736204266548157
C   0.07683441042900085 1.7166537046432495  0.6425957083702087
C   1.5191954374313354  -1.6879788637161255 -0.6116884350776672
H   -0.28098249435424805    5.216770172119141   -1.1826730966567993
H   4.715267181396484   -7.259854316711426  -0.7024464011192322
H   -5.235235691070557  2.649873733520508   2.5625901222229004
H   -5.620849132537842  5.038148880004883   2.955796003341675
H   5.920680046081543   -5.598872661590576  0.6667307615280151
H   2.6951706409454346  -6.622916221618652  -1.9589451551437378
H   -3.258512496948242  1.872567057609558   1.1403477191925049
H   -4.049194812774658  6.670241832733154   1.8391200304031372
H   5.159170150756836   -3.1391680240631104 0.6165569424629211
H   1.903632402420044   -4.267619609832764  -1.9118586778640747
H   -1.9501277208328247 6.8153181076049805  -0.0896596759557724
H   0.1914159506559372  4.09887170791626    1.547421932220459
H   0.13998019695281982 -0.41456520557403564    0.9866563081741333
H   0.2588236927986145  -3.140054225921631  0.41808825731277466
H   2.4749765396118164  -0.5881248712539673 1.9133433103561401
H   2.465522527694702   -1.524481177330017  3.4043023586273193
H   3.3927764892578125  -2.086008310317993  2.0371952056884766
H   3.0007033348083496  4.7001423835754395  1.569632649421692
H   1.753841757774353   5.515769958496094   2.5485167503356934
H   2.472287893295288   4.045438289642334   3.1510086059570312
H   -1.4471417665481567 1.6654247045516968  -0.8613872528076172
H   0.03330877050757408 2.5926074981689453  -1.3316395282745361
H   3.2639055252075195  -1.7258206605911255 -1.7604280710220337
H   3.6439661979675293  -1.4522535800933838 -0.11515222489833832
H   -0.5912652015686035 1.7191565036773682  1.5179795026779175
H   1.0174472332000732  -2.0953493118286133 -1.4486362934112549

}
memory 20GB
set_num_threads(8)
psi4_io.set_default_path("/scratch")
energy('wB97M-D3(BJ)/def2-TZVPPD')
clean()
pavankum commented 1 year ago

@CanisW Thanks for the input. For the dipeptide phe-trp these are the energies I see on 'SPICE Dipeptides Single Points Dataset v1.2' (the record names on the dataset are different from your labeling and I indicated with asterisk the record you posted input for)

phe-trp-0 | -1337.68759142796
phe-trp-1 | -1337.67501538889
phe-trp-2 | -1337.67559381996
phe-trp-3 | -1337.65177393408
phe-trp-4 | -1337.68477736649
phe-trp-5 | -1337.65828109228
phe-trp-6 | -1337.63429009811
phe-trp-7 | -1337.65663178633
phe-trp-8 | -1337.67033907789
phe-trp-9 | -1337.63449133292
phe-trp-10 | -1337.64129423476
phe-trp-11 | -1337.64196623861
phe-trp-12 | -1337.64720898978
phe-trp-13 | -1337.68361958136
phe-trp-14 | -1337.66340234915
phe-trp-15 | -1337.66169252283
phe-trp-16 | -1337.6691782733
phe-trp-17 | -1337.65572149042
phe-trp-18 | -1337.66551279715
phe-trp-19 | -1337.62824858112
phe-trp-20 | -1337.64660759114
phe-trp-21 | -1337.6433082246
phe-trp-22 | -1337.65775690413
phe-trp-23 | -1371.44004049445 ----> QCA ID: 98206913
phe-trp-24 | -1337.65454105432
phe-trp-25 | -1337.77778321702
phe-trp-26 | -1337.77539937729
phe-trp-27 | -1337.77290333888
phe-trp-28 | -1337.76526136228
phe-trp-29 | -1337.76651259807
phe-trp-30 | -1337.75974479219
phe-trp-31 | -1337.75287746101
phe-trp-32 | -1337.76256598646
phe-trp-33 | -1337.75641817085
phe-trp-34 | -1337.76857958872
phe-trp-35 | -1399.83478834261 ----> QCA ID: 98206893
phe-trp-36 | -1337.74971070527
phe-trp-37 | -1337.76903581675
phe-trp-38 | -1337.77298529498
phe-trp-39 | -1337.76970880138
phe-trp-40 | -1337.77273898459
phe-trp-41 | -1337.77049690548
phe-trp-42 | -1337.77190082592
phe-trp-43 | -1337.7683213354
phe-trp-44 | -1337.75164635041
*** phe-trp-45 | -1416.8074365725 ---> QCA ID: 98206917
phe-trp-46 | -1337.76538614168
phe-trp-47 | -1337.75930154735
phe-trp-48 | -1337.77119499015
phe-trp-49 | -1337.77045070311

I confirm your alarming finding of differences in energies, it seems there were some sporadic erroneous runs from psi4, which might be related to an ongoing issue. I could not reproduce the errored calculation and a short term solution would be to weed out these outliers while using the dataset. I may have to do some follow up on a long term solution.

I attached the QCA stdout for this calculation and a rerun with same input as spice set, and they show the difference in energies as you mentioned. The provenance of the output stored on QCA shows

'Psi4', version='1.4a3.dev63', routine='psi4.schema_runner.run_qcschema', memory=120.0, wall_time=7291.645591020584, cpu='Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz'

qca_psi4_stdout_98206917.txt rerun_of_98206917.txt input_for_rerun.py.txt

There were two other conformers for this molecule that show erratic energies and they were run on 'Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz', and 'AMD EPYC 7601 32-Core Processor' with 8 threads/120GB memory config. The outputs for these:
qca_psi4_stdout_98206913.txt qca_psi4_stdout_98206893.txt

Only good thing about this is that our processing of results is not the source of error.

CC: @peastman @jchodera @raimis FYI, I think filtering out these inconsistent data points is necessary for end usage. Out of 33850 conformations, 40 were bad on the dipeptides dataset according to this issue. This is unexpected random behaviour, and since it is not reproducible I cannot pinpoint whether it is some compute node issue (all three calculations were run on different processors), or due to psi4-parallelism/MKL-library as their gh issue mentions, and I am a bit time constrained for further debug/followup.

jchodera commented 1 year ago

@pavankum : Thanks for digging into this. Is it possible that this also impacts other OpenFF datasets?

@bennybp: Would you be able to help us figure out what happened here, how we might be able to systematically filter out the problematic snapshots for the SPICE dataset (and OpenFF datasets, if necessary), and what we might be able to do to fix the problem more permanently in QCFractal?

peastman commented 1 year ago

When I was fitting models to the data, I found I had to filter out samples with unusually large forces. I assumed that was an artifact of the generation process: conformations were generated based on a classical force field, and every now and then a conformation that seemed reasonable according to the force field would be highly strained according to DFT. But looking at phe-trp, I see the three bad samples were all filtered out based on forces. So perhaps it actually indicated an error in psi4?

We could try using that as a way to identify bad samples. Look for anything with an unreasonablely large force and try recalculating them. I believe around 2% of all samples got filtered out, so it wouldn't be that expensive to recalculate all of them.

trevorgokey commented 1 year ago

Hi, I was the person who seemed to be producing some of these bogus calculations. It appears to be a version of psi4 that is causing the issue. @pavankum provided me with an input for phe-trp-45 that I did a few tests on. When I supplied large enough memory for the in-core algorithm to be used, I got an energy of -1416. When I reduced the memory so that the disk algorithm was used I got an energy of -1337. I am using psi4 1.4a3.dev63+afa0c21, and the issue seemed to have gone away by 1.4.1 since his in-core calculation produced the correct energy.

peastman commented 1 year ago

Would it be possible to 1) identify every sample run with the bad version of psi4, 2) create a new subset containing those samples so we can rerun them, and 3) remove the bad samples from the original subsets?

jchodera commented 1 year ago

@trevorgokey : Does the psi4 version info end up in the QCArchive result entries? If so, can we build a filtered version of the dataset that eliminates erroneous entries and post a new HDF5 file until we can regenerate the erroneous entries?

trevorgokey commented 1 year ago

Yes each record has psi4 provenance. Here is a script to pull the provenance from each of the datasets safely and about as fast as possible. It took about 30 minutes to go through the list. I skipped the optimization set as it had a low completion rate and assumed it had bigger issues.

#!/usr/bin/env python

import more_itertools
import qcfractal.interface as ptl
import tqdm
import collections
import json

ds_names = [
    "SPICE DES Monomers Single Points Dataset v1.1",
    "SPICE DES370K Single Points Dataset Supplement v1.0",
    "SPICE DES370K Single Points Dataset v1.0",
    "SPICE Dipeptides Single Points Dataset v1.2",
    "SPICE Ion Pairs Single Points Dataset v1.1",
    "SPICE PubChem Set 1 Single Points Dataset v1.2",
    "SPICE PubChem Set 2 Single Points Dataset v1.2",
    "SPICE PubChem Set 3 Single Points Dataset v1.2",
    "SPICE PubChem Set 5 Single Points Dataset v1.2",
    "SPICE PubChem Set 6 Single Points Dataset v1.2",
    "SPICE Pubchem Set 4 Single Points Dataset v1.2",
    "SPICE Solvated Amino Acids Single Points Dataset v1.1",
]

client = ptl.FractalClient()

method = "wb97m-d3bj"
versions = {}

for name in tqdm.tqdm(ds_names, ncols=80):
    ds = client.get_collection("Dataset", name)

    entries = ds.get_entries().name.tolist()
    batches = more_itertools.chunked(entries, 1000)

    total = len(entries) // 1000 + 1
    versions_ds = collections.defaultdict(list)

    for subset in tqdm.tqdm(batches, total=total, desc=name, ncols=80):
        tries = 0
        while True:
            # if we try to query too much and/or the server is overloaded, we can be
            # dropped by the server
            try:  
                result = ds.get_records(
                    method=method,
                    include=["provenance", "status"],
                    subset=subset,
                ).to_dict()
                break
            except:
                tries += 1
            assert tries < 3

        record_provenance = result["provenance"]
        record_status = result["status"]

        for index, status in record_status.items():
            if status != "COMPLETE":
                continue
            provenance = record_provenance[index]
            if provenance:
                ver = provenance["version"]
            else:
                ver = "None"
            versions_ds[ver].append(index)
    versions[name] = dict(versions_ds)

json.dump({"Dataset": versions}, open("versions.json", "w"))

# TODO:
# OptimizationDataset SPICE Dipeptides Optimization Dataset v1.0

The total counts per version are:

SPICE DES Monomers Single Points Dataset v1.1 {'1.4.1': 18700}
SPICE DES370K Single Points Dataset Supplement v1.0 {'1.4.1': 1984, '1.4a3.dev63': 1608}
SPICE DES370K Single Points Dataset v1.0 {'1.4.1': 343998, '1.4a3.dev63': 1579}
SPICE Dipeptides Single Points Dataset v1.2 {'1.4.1': 32743, '1.4a3.dev63': 1107}
SPICE Ion Pairs Single Points Dataset v1.1 {'1.4.1': 1399, '1.4a3.dev63': 27}
SPICE PubChem Set 1 Single Points Dataset v1.2 {'1.4.1': 100291, '1.4a3.dev63': 18192}
SPICE PubChem Set 2 Single Points Dataset v1.2 {'1.4.1': 99561, '1.4a3.dev63': 21963}
SPICE PubChem Set 3 Single Points Dataset v1.2 {'1.4a3.dev63': 8618, '1.4.1': 113578}
SPICE PubChem Set 5 Single Points Dataset v1.2 {'1.4a3.dev63': 18286, '1.4.1': 104843}
SPICE PubChem Set 6 Single Points Dataset v1.2 {'1.4.1': 123794}
SPICE Pubchem Set 4 Single Points Dataset v1.2 {'1.4.1': 109101, '1.4a3.dev63': 13629}
SPICE Solvated Amino Acids Single Points Dataset v1.1 {'1.4.1': 1300}

Quite a few with the earlier version. However we can probably find a smaller subset if one goes into the stdout of each and parses whether it used the in-core or disk algorithm. What is preferred? If we want to clear out this version completely, I can attach the resulting json file that maps the version to the Dataset indices. I was not part of the submission process so I am not sure what is needed to identify the subset for resubmission.

peastman commented 1 year ago

Thanks. Those add up to 1051292 for 1.4.1 and 85009 for 1.4a3.dev63. If we can further refine the list based on whether it used in-core or disk, that would be great. In total it was used for about 7.5% of the conformations, but I don't think that many had errors. Large forces only occurred in about 2%.

A related question: is there something we can do on the QCFractal side so in the future we can guarantee all workers are using the correct version of psi4?

trevorgokey commented 1 year ago

It looks like only ~1800 were on disk, which were presumably the largest molecules. I'll randomly sample some from the possible offending set today and see if I get different energies with 1.4.1.

As far as version constraints, I am not the best person to ask but I know it's been a discussion topic. @bennybp may offer some input here. This case would definitely be a +1 for putting versions into the QC spec.

trevorgokey commented 1 year ago

Here is the result of the scan. It looks like most that used the bad settings/version are OK, but there are some scattered errors that are not exclusive to dipeptides. I used Psi4 1.6.1 due to some difficulty getting 1.4.1. Pavan is working on 1.4.1 for the same set but I'd expect similar results.

Name                                                  |  Index               |  Psi4 1.6.1    |  Psi4 1.4a     |  DE         
------------------------------------------------------------------------------------------------------------------------------
SPICE Dipeptides Single Points Dataset v1.2           |  phe-met-3           |  -1.45173e+03  |  -1.45173e+03  |  -1.19758e-09
SPICE Dipeptides Single Points Dataset v1.2           |  phe-lys-4           |  -1.14857e+03  |  -1.14857e+03  |  -3.99914e-06
SPICE Dipeptides Single Points Dataset v1.2           |  phe-met-20          |  -1.45173e+03  |  -1.45173e+03  |  -4.30891e-07
SPICE DES370K Single Points Dataset v1.0              |  cc(=o)n [li+]-13    |  -2.16701e+02  |  -2.16701e+02  |  -5.47287e-08
SPICE DES370K Single Points Dataset v1.0              |  cc(=o)o [li+]-17    |  -2.36554e+02  |  -2.36554e+02  |  -2.67762e-08
SPICE DES370K Single Points Dataset v1.0              |  cc(c)c [na+]-1      |  -3.20642e+02  |  -3.20642e+02  |  -2.27618e-08
SPICE DES370K Single Points Dataset Supplement v1.0   |  [Li+].CSSC-25       |  -8.83683e+02  |  -8.83683e+02  |   2.59091e-08
SPICE DES370K Single Points Dataset Supplement v1.0   |  [Li+].CC(=O)N-11    |  -2.16708e+02  |  -2.16708e+02  |  -4.88297e-08
SPICE PubChem Set 1 Single Points Dataset v1.2        |  125084655-10        |  -6.89832e+02  |  -6.89832e+02  |  -2.39495e-08
SPICE PubChem Set 1 Single Points Dataset v1.2        |  123110178-24        |  -1.48504e+03  |  -1.48504e+03  |   1.29959e-06
SPICE PubChem Set 1 Single Points Dataset v1.2        |  104083449-11        |  -2.90711e+03  |  -2.90711e+03  |  -3.57471e-07
SPICE PubChem Set 2 Single Points Dataset v1.2        |  103917654-34        |  -9.36038e+02  |  -9.36038e+02  |  -1.24026e-06
SPICE PubChem Set 2 Single Points Dataset v1.2        |  103919895-1         |  -1.01222e+03  |  -1.01222e+03  |  -1.41133e-06
SPICE PubChem Set 2 Single Points Dataset v1.2        |  103926559-22        |  -9.01019e+02  |  -9.01019e+02  |   1.88787e-07
SPICE PubChem Set 3 Single Points Dataset v1.2        |  103925154-40        |  -2.30982e+03  |  -2.30982e+03  |  -9.04750e-07
!!! SPICE PubChem Set 3 Single Points Dataset v1.2        |  103925364-8         |  -1.23762e+03  |  -1.30025e+03  |  -6.26254e+01
SPICE PubChem Set 3 Single Points Dataset v1.2        |  103941102-30        |  -4.04861e+03  |  -4.04861e+03  |  -8.48815e-07
SPICE Pubchem Set 4 Single Points Dataset v1.2        |  103925531-31        |  -1.15973e+03  |  -1.15973e+03  |  -2.43947e-07
SPICE Pubchem Set 4 Single Points Dataset v1.2        |  103925870-29        |  -1.27892e+03  |  -1.27892e+03  |  -7.29444e-06
SPICE Pubchem Set 4 Single Points Dataset v1.2        |  103925870-43        |  -1.27891e+03  |  -1.27891e+03  |  -8.10549e-06
SPICE PubChem Set 5 Single Points Dataset v1.2        |  103915235-19        |  -9.72691e+02  |  -9.72691e+02  |   2.65117e-07
SPICE PubChem Set 5 Single Points Dataset v1.2        |  103935718-16        |  -7.62440e+02  |  -7.62440e+02  |  -2.37496e-07
!!! SPICE PubChem Set 5 Single Points Dataset v1.2        |  103923869-17        |  -1.35836e+03  |  -1.38187e+03  |  -2.35105e+01
SPICE Ion Pairs Single Points Dataset v1.1            |  [li+:1].[cl-:2]-32  |  -4.67717e+02  |  -4.67717e+02  |   2.84717e-09
SPICE Ion Pairs Single Points Dataset v1.1            |  [li+:1].[i-:2]-43   |  -3.05250e+02  |  -3.05250e+02  |  -5.96856e-09
SPICE Ion Pairs Single Points Dataset v1.1            |  [li+:1].[cl-:2]-2   |  -4.67833e+02  |  -4.67833e+02  |  -4.81748e-10
peastman commented 1 year ago

Thanks! If there's only 1800 we need to rerun that should be really easy. The other thing we should check is whether the other suspect samples are really correct. There were about 20,000 samples that I excluded from fitting due to very large forces. If most of them did not use the incorrect settings, does that mean the forces really were that large? Let's test a few of them with the newer psi4 to make sure.

trevorgokey commented 1 year ago

The 1,800 used the disk algorithm which should have given the correct result, and so I did not include those in the sample. I sampled from the other 85k which used the core algorithm.

peastman commented 1 year ago

Ok. So we have about 83,000 that used the bad settings, most of which are correct but an unknown subset are not. I checked the two bad samples listed in your table above, and they also are ones that got excluded due to absurdly large forces. So that probably is a reliable way of identifying the bad samples, although we can't guarantee it. So we have two options.

  1. Rerun all samples that used the bad settings (about 83,000).
  2. Rerun only the ones with large forces (about 20,000).
trevorgokey commented 1 year ago

@pavankum @dotsdl I think we have a course of action to resubmit which should in turn fix this issue.

pren commented 1 year ago

The bad ones are quite obvious and the errors are of several STDEV (the unit is hartree or 600 kcal/mol).

peastman commented 1 year ago

Do we have a plan for moving ahead with this?

davidlmobley commented 1 year ago

@peastman this is probably a good issue to discuss with @jchodera to figure out a plan; perhaps he can also help orchestrate with MolSSI, since the issue of "can we delete bad samples" is a MolSSI thing, and the "who" issue is also nontrivial.

On the OpenFF end, we don't have someone to devote to this at present. My postdoc Pavan Behara lent a hand with running it, and our understanding was that this was a one-off effort to support ongoing research on your end. He's now focused on force field fitting for a new point release and our next major release, Rosemary. He could have time after that, but that would be some months out. The other obvious candidates on our end are either similarly tied up or supported primarily by industry funds for stuff which is on our short-term roadmap (rather than this dataset) or both.

peastman commented 1 year ago

How about scheduling a meeting to discuss it? My schedule for this week is mostly open.

davidlmobley commented 1 year ago

If you and John schedule a meeting, I'm happy to jump in on it, though (a) I'm fully booked this week, and (b) I don't have any way i can lend a hand with it at present, so you don't want to meet with just me because I don't know the answers to your questions.

jchodera commented 1 year ago

@peastman : I'm meeting with @dotsdl tomorrow to see if we can work out a plan for tackling this. Will update you afterwards.

peastman commented 1 year ago

Sounds good, thanks.

jchodera commented 1 year ago

@peastman : Given our personnel resource constraints, here's the best we could come up with:

  1. In the very short term, @dotsdl will work with you to modify the HDF5 retrieval script to drop out the ~83K snapshots generated with the faulty psi4 release. We can release a spice-1.2.1 bugfix release that omits any data that has a chance of being erroneous. It should not impact the resulting artifact too much.
  2. On a slightly longer term, @dotsdl is charged with creating a standard operating procedure for archival of OpenFF datasets from QCArchive to Zenodo. The SPICE dataset at the OpenFF default spec level of theory will also be archived there.
  3. When resourcing permits, @dotsdl or a new developer charged with supporting QCFractal will prepare a spice-1.3 submission where the ~83K snapshots that must be regenerated are slightly displaced in space so the QCFractal infrastructure will just recompute those displaced snapshots. The other snapshots will be pulled from the older QCArchive datasets. This is going to require a substantial number of hours to do it carefully and correctly. We might also consider combining these into a single dataset at this point to allow archival on the QCArchival ML dataset platform as a single monolithic dataset if desired.

Separately, we can discuss with @giadefa whether it makes sense to continue to use QCFractal for new datasets. It was not quick for SPICE, it cannot easily support active learning, and we are very limited in personnel time able to support the generation of new ML datasets at this time. Other options you have direct control over---such as Stanford clusters or potentially supercomputers Tom has access to---may be better options.

peastman commented 1 year ago

drop out the ~83K snapshots generated with the faulty psi4 release

Based on the above results, most of that data is actually correct. It's probably only a small fraction that has problems: the ones that fail the force check.

I can help with preparing the new submission. I just need instructions on how to do it.

Other options you have direct control over---such as Stanford clusters or potentially supercomputers Tom has access to---may be better options.

Rerunning those samples just on the Stanford clusters will take a few months. During the initial generation, they represented about 15% of the computation.

In any case, the data needs to get into QCArchive one way or another. That's what the Zenodo releases get built from.

jchodera commented 1 year ago

We can skip step (1), but step (3) will require a significant amount of training and direct coordination / engagement with stakeholders. Are you sure you are up for doing this? If so, I'll see what might be possible since this will still require training time from the OpenFF side.

peastman commented 1 year ago

Is there documentation describing what needs to be done? I can take a look and see how much work it involves.

jchodera commented 1 year ago

It would be useful to have a quick chat with @dotsdl first to have him fill you in on who would need to be involved, what would need to be done, and what documentation is extant for your review.

peastman commented 1 year ago

Shall we have a call with the three of us? My schedule is very flexible. Send an invite for whenever is good for you.

peastman commented 1 year ago

It looks like this is going to take a while, so I'm regenerating the HDF5 file with the force filter enabled. That should remove the bad samples. I'll post it to Zenodo as version 1.1.3.

@dotsdl it would be really helpful if we could meet to talk about this. I'd like to get started on regenerating the data. Can you email me a list of times when you're available?

dotsdl commented 1 year ago

@peastman if you are comfortable cutting a new release to Zenodo using the force filter for the problematic samples, then I can take on making new submissions to QCArchive for correcting those samples. I don't believe a meeting will help us here, frankly; I know what needs to be done, it's just time-consuming and careful work, and I have several other projects at higher priority at this time.

I will select all calculations run with psi4 < 1.4.1, translate their molecules by an Angstrom, then submit these along with the originals as v1.3 versions for each of the SPICE single point sets.

At present we are running with psi4 == 1.6.1 in production. If this is a problem, please offer reasons why before we proceed.

peastman commented 1 year ago

If you have the time to help, that's great. Part of the goal though is for me to learn how to do it!

I will select all calculations run with psi4 < 1.4.1

Sounds good.

translate their molecules by an Angstrom

Why translate them?

submit these along with the originals as v1.3 versions for each of the SPICE single point sets.

How exactly will that work out in practice? If someone downloads the corresponding subset will they see a mix of the old calculations for samples we didn't rerun, and new calculations for ones we did? Or will this appear as a completely new dataset, and we need to figure out how to mix and match from the two? And if you have both an original and a translated copy of each sample, does that mean it will show up as two samples?

At present we are running with psi4 == 1.6.1 in production. If this is a problem, please offer reasons why before we proceed.

Mixing psi4 versions is what got us into this position in the first place. We need to be very careful about this. If it's necessary to switch to a newer version, we first need to thoroughly validate that it produces identical results for a decent size collection of samples.

dotsdl commented 1 year ago

translate their molecules by an Angstrom

Apologies, I should have clarified this. QCFractal deduplicates based on a hash of molecule inputs. So, to trigger recompute of only the molecules we wish to recompute, the least invasive way to break the hash is to translate the coordinates of those input molecules. In this way, we need only recompute the records we actually want to.

dotsdl commented 1 year ago

So, the v1.3 versions of the datasets would feature both the existing records that we didn't find problematic and new calculations in place of the ones that were problematic. They would be complete datasets just like the v1.2 versions.

peastman commented 1 year ago

Ok, that sounds good. If we're going to use a different version of psi4, we need to validate that it produces identical results to 1.4.1. Perhaps take 100 randomly chosen samples from the set, compute them with both, and check that all outputs match to the expected precision?

trevorgokey commented 1 year ago

That was the test I essentially did but with 3 random samples from each set (N=26; there was one failure I didn't pursue). I used 1.6.1, and most were close to 1.4 except those few egregious cases.

peastman commented 1 year ago

I thought you compared 1.4a3.dev63+afa0c21 to 1.4.1, and all of the errors were fixed in 1.4.1?

trevorgokey commented 1 year ago

I pulled the 1.4a3.dev63+afa0c21 jobs, reran them with 1.6.1, and posted those results in the table above. Pavan was running some with 1.4.1 using the same script; he can chime in with any updates, but the table showed that 1.6.1 usually performed the same as 1.4a all cases except those 2 errors, and so 1.4.1 should give similar results if it indeed fixes the original bug.

peastman commented 1 year ago

I ran some more tests to understand what's happening. I selected 50 samples that have very large forces (>1 hartree/bohr) and recomputed them with both 1.4.1 and 1.6.1. One sample failed to converge with 1.4.1. Here are the maximum forces for all three versions for the other 49 samples. Large forces are shown in bold.

Sample SPICE 1.4.1 1.6.1
103958910-47 27.66 0.08885 0.08885
135081216-3 18.31 18.35 18.39
135025755-8 3142 3142 7.987
135354304-12 2.81 0.08594 0.08594
136965498-1 3.007 0.06942 0.06942
135143569-6 5.285 0.1252 0.1252
135236286-5 13.79 0.1179 0.1179
104115701-20 2.198e+05 2.198e+05 81.33
135161177-1 6.094 0.1003 0.1003
134993094-36 1.092e+05 1.092e+05 22.66
104090306-46 2.733 0.04137 0.04137
11108569-18 4.093 0.07104 0.07104
135175367-35 5.691 0.08434 0.08434
135261556-8 2.277 0.0683 0.0683
135184924-2 1.487 0.04522 0.04522
135135588-4 1.377 0.08939 0.08939
135030147-16 5.439 0.0832 0.0832
103932730-47 7.032 0.05266 0.05266
135120119-4 6.857 0.1104 0.1104
135231149-23 2.036 0.1216 0.1216
135018521-21 2.706 2.696 0.09297
11109496-11 2.639 0.1199 0.1199
135007168-33 4.982 0.05493 0.05493
104048049-6 15.24 0.08228 0.08228
135045604-2 2.406 0.09543 0.09543
125086940-21 9.124e+04 9.124e+04 25.72
135093763-49 12.36 0.03258 0.03258
134978476-0 2.355 2.355 0.04331
103934044-29 27.87 0.02362 0.02362
135239870-0 2.48 0.05822 0.05822
103956336-32 36.23 0.2003 0.2003
135213822-49 1.167e+04 1.167e+04 28.06
135263365-2 4.327 0.142 0.142
104101928-3 13.5 0.09252 0.09252
103941373-14 1.107 1.107 1.107
135101715-38 1.007e+05 1.007e+05 29.47
136350598-17 2.277 0.07219 0.07219
104025843-48 2.162 0.06097 0.06096
135084333-13 8.421e+05 8.421e+05 148.2
131342172-14 2.069 0.07299 0.07299
135085710-39 6.387 6.387 6.466
135136967-22 4.61 0.07233 0.07233
104122545-9 1.053 1.053 1.053
135275844-37 2.653 0.0721 0.0721
134976068-13 5.799 0.06748 0.06748
125087597-16 4.138 0.05918 0.05918
135133198-17 1.636 1.636 0.05963
104149025-3 2.759e+05 2.759e+05 115.1
135234569-2 5.507 0.1416 0.1416

For 34 of them, 1.4.1 and 1.6.1 agree with each other and now produce much more reasonable results.

For 11 of them, 1.4.1 agrees with the value reported in SPICE, but 1.6.1 produces a much lower force. For three of those the force is now below the cutoff. For the other 8 the force is still extremely large, just not as insanely large as it was before.

There were four samples for which all three versions agree. I don't know if that means they're really physically accurate, but if the large forces in those samples are due to a bug, it still hasn't been fixed in 1.6.1.

So it looks like there was an improvement from 1.4a3.dev63+afa0c21 to 1.4.1 and a further improvement from 1.4.1 to 1.6.1. But it also looks like many of these samples either really do have huge forces, or are very difficult to converge. They still have forces that would be way above the cutoff even with 1.6.1.

peastman commented 10 months ago

We're getting ready to run the new calculations, which means we need to pick what Psi4 version to use. I repeated the above calculations with 1.8.1, which is the most recent release. The results are identical to 1.6.1. So I think it makes sense to use 1.8.1 for all the SPICE 2.0 calculations.

As for which samples to rerun, I suggest any that either were done with the prerelease 1.4 version, or that have a force component above 1.0. That should be somewhere in the range of about 20,000 samples, which won't take long to run.

kavi9030 commented 5 months ago

@peastman According to the previous discussion, the plan was to rerun all samples that used the bad settings (about 83,000) and the ones with large forces (about 20,000). May I know when this fix would become available?

peastman commented 5 months ago

It will be in SPICE 2.0. I don't have an ETA for that yet. Probably a few months.

peastman commented 3 months ago

SPICE 2 is now released, so this can be closed.