petosa / mongo_qcdb

MongoDB backend for storing quantum chemical databases
0 stars 0 forks source link

Hashing strategy, looking forward #4

Open petosa opened 7 years ago

petosa commented 7 years ago

Although the current hashing strategy works, it has a bit of a drawback - anytime you add a new field to the molecule json, the hash for each molecule entry is effectively changed. This means that any time you add/remove a field in all molecules, either you:

I'm not sure how often we will need to change the JSON structure for a molecule once we go into production, but it's a good idea to take some proactive measures.

I'm wondering, @dgasmith , what are the minimal fields that define a unique molecule? You mentioned that geometry is not enough, but would hashing symbol + geometry or maybe name + geometry be sufficient? In this way, by basing our hashes on 2 or 3 essential fields, minor changes by adding or removing peripheral fields to all molecule JSONs won't have an avalanche effect on our hashes.

dgasmith commented 7 years ago

From our current list:

  "symbols": ["C", "O", "O"],
  "masses": [16.0, 18.0, 18.0],
  "name": "Carbon Dioxide",
  "charge": 0.0,
  "multiplicity": 1,
  "ghost": [false, false, false],
  "geometry": [ .. ]

This should be enough. We also need one more field called "fragments" that will be a list of list which will need to be included in the hash as well.

I hope that once we build a molecule there will be no reason to rebuild it. However, I cannot definitively say this.

petosa commented 7 years ago

@dgasmith Similar question for pages and databases, what fields define the uniqueness of a page and database document?

For example, is it possible to have multiple databases with the same name? Or multiple pages with the same molecule and method?

petosa commented 7 years ago

In the mean time, as of 357f2803556739a2745e312424904dd0910de143 the hash functions are calculated with the fields described in the readme. The fragments field is also added to molecules.