theochem / AtomDB

An Extended Periodic Table of Neutral and Charged Atomic Species
http://atomdb.qcdevs.org/
GNU General Public License v3.0
14 stars 12 forks source link

[Doc] Discoverability of datasets' properties #15

Open gabrielasd opened 5 months ago

gabrielasd commented 5 months ago

We should have a table of which property is available for each dataset, both in the code, and in the published documentation.

In the library, we should also gracefully handle error cases where the user attempts to access a property that is unavailable in the current dataset.

Ansh-Sarkar commented 4 months ago

Hi @gabrielasd . Would love to work on this issue. Am new to AtomDB and currently exploring the codebase in order to be able to help with some meaningful contributions.

By the way I was referring to the Getting Started NoteBook and the following link (for a list of all the datasets available) seems to be pointing back to the same notebook. image

gabrielasd commented 4 months ago

Hi @Ansh-Sarkar, glad to hear you are interested in contributing to our package! And thanks for bringing this to our attention. You are correct, some of the links in that notebook are broken. Since this package is still in the stage of development for being released, there might be some incomplete or missing documentation.

For now I can point you to the folder where all datasets will be found once compiled: https://github.com/theochem/AtomDB/tree/master/atomdb/datasets From there the folders nist and slater contain some compiled atomic data (inside a db folder) so that one can explore atomdb's features.

I'd also suggest you look at the hello_atomdb notebook, which is a more resent version of the one you are looking at.

In regards to this issue, I should mention it is also related to #16. They both refer to the same problem of how to document and make available to the user the properties stored from each source of atomic data. For an idea of the properties we aim to support you can see issue #4.

Please, don't hesitate to ask if you have any questions or need further assistance.

Ansh-Sarkar commented 4 months ago

Thank you @gabrielasd for directing me to the aforementioned resources. They were very helpful! I've gone through the entire codebase and tried to analyze how everything comes together. Briefly mentioning my learnings here before commencing to a few proposed solutions. Please do let me know if there are any flaws in my understanding. Hopefully this also helps new contributors get started with their contributions faster.

Summarized Learnings

All the datasets can be found in this folder: /atomdb/datasets/. Under some of the folders corresponding to a dataset (/gaussian, /nist, /slater as of this writing) we have a folder named db consisting of certain MessagePack files that act as the source of data.

Whenever the load() method is called with specific parameters (element, charge, multiplicity, ...optionals) data is fetched from the MessagePack files and an object of the Species class is returned. The actual data is present in an instance of the SpeciesData class that is initialized internally and used by the Species class, which provides a few other utility functions while acting as a wrapper for the SpeciesData object. The __init__.py file present under each available dataset folder, acts as a script that converts the data from different dataset formats, and standardizes it by creating corresponding Species class instances. These instances form the fundamental format of data on which AtomDB works.

Not every property is available in each one of the standard datasets available under the atomdb/datasets/ directory. Only certain fields of the Species object can be populated (depending on the dataset being used), while the others are set to default values (usually None). Hence, as mentioned in this issue (#15) it is important to enable discovery of properties, as well as handling potential attempts at accessing properties that are unavailable for a given dataset, in a graceful manner. The various properties being targetted have been broadly classified into Scalars and Vectors and have been listed in #4 .

Proposed Solutions

The following points aim to address potential solutions for issues #15 and #16

Questions

msricher commented 4 months ago

I like these solutions.

Ansh-Sarkar commented 4 months ago

Thanks @msricher ! Have started working on these solutions.

Ansh-Sarkar commented 3 months ago

Hi ! A quick update on the progress. Will raise a PR soon if this is going in the right direction (currently only making changes to the gaussian dataset).

image

image

Would love to hear feedback if any, so that I could make any necessary corrections. Thanks :smile: and have a great day ahead !

Ansh-Sarkar commented 3 months ago

Hey @gabrielasd @msricher just wanted to ask if I am good to go with the above changes and open a PR for the same ? Do let me know if you have any feedback on this. Thanks !

gabrielasd commented 3 months ago

Hi @Ansh-Sarkar , please do submit your changes as a PR, that way we can better look at the code and give you feedback.