Store pKa datasets as JSON files

The pKa datasets are currently stored in .txt files with the following structure:

# pKa-values from Handbook of Chemistry and Physics, 72nd Edition, CRC Press, Boca Raton, FL, 1991.
{"D": {"pka_value": 3.65, "acidity": "acidic"}, "E": {"pka_value": 4.25, "acidity": "acidic"}, ...}

To improve both machine-readability and human-readability, the format could be redesigned like so:

{
  "metadata": {
    "summary": "pKa-values from CRC 72nd edition",
    "source": "Handbook of Chemistry and Physics, 72nd Edition, CRC Press, Boca Raton, FL, 1991.",
    "isbn": "0-8493-0565-9"
  },
  "data": {
    "D": {"pka_value": 3.65, "acidity": "acidic"},
    "E": {"pka_value": 4.25, "acidity": "acidic"}
  }
}

The proposed format is fully JSON-compliant and improves several aspects of FAIR data[^fair-data-wikipedia]^,[^fair-data-wilkinson], in particular:

F2. Data are described with rich metadata
I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
I3. (Meta)data include qualified references to other (meta)data

Regarding accessible language for knowledge representation: the data is formatted in a JSON-like format, however a JSON parser would not be able to read the file. Comment lines are explicitly disallowed in the JSON standard. There are competing standards like JSON5 ("JSON5 Data Interchange Format") or Microsoft JSONC ("JSON with comments"), but those two use C-style comment lines like // or /* */ instead of the pound symbol.

Regarding metadata, the comment line provides valuable information about the source of the dataset, which can be extremely relevant to users. For example, the CRC Handbook of Chemistry has about 100 editions, and being able to query from the Python interface which edition was loaded into pyMBE could prove useful, for example to generate a BibTeX file to properly cite the source of the dataset, or make sure we are not using an edition that contains a typo in the pKa values. The metadata block introduced in the proposed format would enable such a feature.

The proposed format is also more human-readable, as it can be split over multiple lines, or opened in Firefox which has a built-in JSON viewer. Pretty-printing existing JSON files can be done automatically in Python:

import json
pka_set = json.loads(
  '{"D": {"pka_value": 3.65, "acidity": "acidic"}, "E": {"pka_value": 4.25, "acidity": "acidic"}}'
)
print(json.dumps(pka_set, indent=2))

Output:

{
  "D": {
    "pka_value": 3.65,
    "acidity": "acidic"
  },
  "E": {
    "pka_value": 4.25,
    "acidity": "acidic"
  }
}

References: [^fair-data-wikipedia]: Concise summary of FAIR data on Wikipedia: https://en.wikipedia.org/wiki/FAIR_data [^fair-data-wilkinson]: Wilkinson et al. 2016. "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data. 3(1): 160018. doi:10.1038/SDATA.2016.18

Original issue by @jngrad migrated from our original remote repository on Gitlab, see here for further discusion on the topic here

pyMBE-dev / pyMBE

Store pKa datasets as JSON files #5