openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
147 stars 9 forks source link

Mayer and Wiberg-Lowdin bond-indices missing from some subsets #76

Closed IgnacioJPickering closed 1 year ago

IgnacioJPickering commented 1 year ago

After parsing the dataset I found that some or all of the Wiberg-Lowdin and Mayer indices are missing for some subsets, specifically for:

MBIS seems to be missing from DES370K Supplement and Ion-Pairs too, but from issue #48 I gather that this is to be expected since most conformations could not converge MBIS in those subsets.

I wanted to double check that it is indeed intended that the bond indices are missing from these subsets, and if so what is the reason for this (I found it strange that they are present in PubChem Set 6 but not in the rest).

I haven't checked if the bond-indices are missing for all conformations or just some of them.

(This is mostly to double-check that I'm parsing the datasets correctly, I don't really have a use for the bond-indices currently)

IgnacioJPickering commented 1 year ago

https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2022-06-08-QMDataset-ion-pairs#metadata

this link seems to imply that the bond indices should be available for ion-pairs

peastman commented 1 year ago

No idea what's up with that. It seems to be by molecule, not subset. I just ran a count of how many molecules in each subset do or don't have Wiberg bond orders.

Subset Have Don't Have
SPICE DES Monomers Single Points Dataset v1.1 374 0
SPICE DES370K Single Points Dataset Supplement v1.0 6 87
SPICE DES370K Single Points Dataset v1.0 3397 0
SPICE Dipeptides Single Points Dataset v1.2 567 110
SPICE Ion Pairs Single Points Dataset v1.1 12 16
SPICE PubChem Set 1 Single Points Dataset v1.2 453 1919
SPICE PubChem Set 2 Single Points Dataset v1.2 411 2020
SPICE PubChem Set 3 Single Points Dataset v1.2 1447 999
SPICE PubChem Set 4 Single Points Dataset v1.2 568 1887
SPICE PubChem Set 5 Single Points Dataset v1.2 434 2029
SPICE PubChem Set 6 Single Points Dataset v1.2 2476 0
SPICE Solvated Amino Acids Single Points Dataset v1.1 26 0

There are a few subsets for which every molecule has bond orders, but in most cases some molecules do and some don't.

I queried the ion pairs dataset from QCArchive to see whether the data is missing there, or where it's a problem in the downloader script. For about half the records, no only are the Wiberg bond orders missing, but the whole extras section of the record is completely empty.

@pavankum any idea what's going on?

pavankum commented 1 year ago

I tried to dig into it but I am getting None when I try to retrieve records, might be something to do with the server migration I will ping @bennybp on slack.

IgnacioJPickering commented 1 year ago

@peastman Thanks for the response, I suppose I'm parsing the data correctly then, I just missed the issue in Dipeptides for some reason. I downloaded the dataset from Zenodo FWIW, I did not use the downloader script.

pavankum commented 1 year ago

@peastman : @bennybp helped me with the debug, data for the key "WIBERG_LOWDIN_INDICES" is populated for all the completed calculations, and data for a redundant key with spaces "WIBERG LOWDIN INDICES" is not present in all. I checked on the Ion Pairs dataset and I could see 1426 records with Wiberg indices if I used the right key and 1389 with the second one with spaces.

I checked another small dataset, DES370K supplement, and I see 3631/3631 with the right key and 2004/3631 with the second one with spaces.

On a side note, I got a conda env for accessing the legacy server from Ben, I was getting None before with 0.15.6

name: qcportal_legacy
channels:
  - conda-forge
  - defaults
dependencies:
  - qcportal=0.15.8
  - msgpack-python=1.0.2=py39hff7bd54_1
  - pandas=1.3.5=py39h8c16a72_0
  - pydantic=1.9.0=py39h7f8727e_0
  - python=3.9.7=h12debd9_1
  - qcelemental=0.24.0=pyhd8ed1ab_0
  - nglview
peastman commented 1 year ago

Can you show how you're accessing it? I retrieve the records from the dataset with ds.get_records(). Then I look up the data from them with [recs.iloc[i].record.dict()['extras'] for i in range(len(recs))]. For about half the records in the ion pairs dataset, it's empty.

pavankum commented 1 year ago

I think I was doing almost the same

import qcportal as qcp

client = qcp.FractalClient()
ds = client.get_collection('Dataset', 'SPICE Ion Pairs Single Points Dataset v1.1')
for row in ds.list_records().iloc:
        spec = row.to_dict()
        if spec['method'] == 'wb97m-d3bj':
            recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])
            break
for r in recs.iterrows():
    print(r[1].record.extras)
    break
peastman commented 1 year ago

Here's what I do:

from qcportal import FractalClient
fc = FractalClient()
ds = fc.get_collection('Dataset', 'SPICE Ion Pairs Single Points Dataset v1.1')
spec = ds.list_records().iloc[0].to_dict()
recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])
print([recs.iloc[i].record.extras.keys() for i in range(len(recs))])

For about half the records there are two keys: dict_keys(['_qcfractal_tags', 'qcvars']). And for the other half it's empty: dict_keys([]).

pavankum commented 1 year ago

your call is accessing a different spec

spec = ds.list_records().iloc[0].to_dict()

output:
{'driver': 'gradient',
 'program': 'psi4',
 'method': 'b3lyp',
 'basis': 'dzvp',
 'keywords': 'openff-default',
 'name': 'B3LYP/dzvp-openff-default'}
peastman commented 1 year ago

The updated file is now available on Zenodo. Thanks for reporting this!