pyMBE-dev / pyMBE

pyMBE provides tools to facilitate building up molecules with complex architectures in the Molecular Dynamics software ESPResSo. For an up-to-date API documention please check our website:
https://pymbe-dev.github.io/pyMBE/pyMBE.html
GNU General Public License v3.0
6 stars 8 forks source link

Store pKa datasets as JSON files #5

Closed pm-blanco closed 3 months ago

pm-blanco commented 6 months ago

The pKa datasets are currently stored in .txt files with the following structure:

# pKa-values from Handbook of Chemistry and Physics, 72nd Edition, CRC Press, Boca Raton, FL, 1991.
{"D": {"pka_value": 3.65, "acidity": "acidic"}, "E": {"pka_value": 4.25, "acidity": "acidic"}, ...}

To improve both machine-readability and human-readability, the format could be redesigned like so:

{
  "metadata": {
    "summary": "pKa-values from CRC 72nd edition",
    "source": "Handbook of Chemistry and Physics, 72nd Edition, CRC Press, Boca Raton, FL, 1991.",
    "isbn": "0-8493-0565-9"
  },
  "data": {
    "D": {"pka_value": 3.65, "acidity": "acidic"},
    "E": {"pka_value": 4.25, "acidity": "acidic"}
  }
}

The proposed format is fully JSON-compliant and improves several aspects of FAIR data[^fair-data-wikipedia],[^fair-data-wilkinson], in particular:

Regarding accessible language for knowledge representation: the data is formatted in a JSON-like format, however a JSON parser would not be able to read the file. Comment lines are explicitly disallowed in the JSON standard. There are competing standards like JSON5 ("JSON5 Data Interchange Format") or Microsoft JSONC ("JSON with comments"), but those two use C-style comment lines like // or /* */ instead of the pound symbol.

Regarding metadata, the comment line provides valuable information about the source of the dataset, which can be extremely relevant to users. For example, the CRC Handbook of Chemistry has about 100 editions, and being able to query from the Python interface which edition was loaded into pyMBE could prove useful, for example to generate a BibTeX file to properly cite the source of the dataset, or make sure we are not using an edition that contains a typo in the pKa values. The metadata block introduced in the proposed format would enable such a feature.

The proposed format is also more human-readable, as it can be split over multiple lines, or opened in Firefox which has a built-in JSON viewer. Pretty-printing existing JSON files can be done automatically in Python:

import json
pka_set = json.loads(
  '{"D": {"pka_value": 3.65, "acidity": "acidic"}, "E": {"pka_value": 4.25, "acidity": "acidic"}}'
)
print(json.dumps(pka_set, indent=2))

Output:

{
  "D": {
    "pka_value": 3.65,
    "acidity": "acidic"
  },
  "E": {
    "pka_value": 4.25,
    "acidity": "acidic"
  }
}

References: [^fair-data-wikipedia]: Concise summary of FAIR data on Wikipedia: https://en.wikipedia.org/wiki/FAIR_data [^fair-data-wilkinson]: Wilkinson et al. 2016. "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data. 3(1): 160018. doi:10.1038/SDATA.2016.18

Original issue by @jngrad migrated from our original remote repository on Gitlab, see here for further discusion on the topic here