[Doc] Discoverability of datasets' properties

gabrielasd commented 5 months ago

We should have a table of which property is available for each dataset, both in the code, and in the published documentation.

In the library, we should also gracefully handle error cases where the user attempts to access a property that is unavailable in the current dataset.

Ansh-Sarkar commented 4 months ago

Hi @gabrielasd . Would love to work on this issue. Am new to AtomDB and currently exploring the codebase in order to be able to help with some meaningful contributions.

By the way I was referring to the Getting Started NoteBook and the following link (for a list of all the datasets available) seems to be pointing back to the same notebook.

gabrielasd commented 4 months ago

Hi @Ansh-Sarkar, glad to hear you are interested in contributing to our package! And thanks for bringing this to our attention. You are correct, some of the links in that notebook are broken. Since this package is still in the stage of development for being released, there might be some incomplete or missing documentation.

For now I can point you to the folder where all datasets will be found once compiled: https://github.com/theochem/AtomDB/tree/master/atomdb/datasets From there the folders nist and slater contain some compiled atomic data (inside a db folder) so that one can explore atomdb's features.

I'd also suggest you look at the hello_atomdb notebook, which is a more resent version of the one you are looking at.

In regards to this issue, I should mention it is also related to #16. They both refer to the same problem of how to document and make available to the user the properties stored from each source of atomic data. For an idea of the properties we aim to support you can see issue #4.

Please, don't hesitate to ask if you have any questions or need further assistance.

Ansh-Sarkar commented 4 months ago

Thank you @gabrielasd for directing me to the aforementioned resources. They were very helpful! I've gone through the entire codebase and tried to analyze how everything comes together. Briefly mentioning my learnings here before commencing to a few proposed solutions. Please do let me know if there are any flaws in my understanding. Hopefully this also helps new contributors get started with their contributions faster.

Summarized Learnings

All the datasets can be found in this folder: /atomdb/datasets/. Under some of the folders corresponding to a dataset (/gaussian, /nist, /slater as of this writing) we have a folder named db consisting of certain MessagePack files that act as the source of data.

Whenever the load() method is called with specific parameters (element, charge, multiplicity, ...optionals) data is fetched from the MessagePack files and an object of the Species class is returned. The actual data is present in an instance of the SpeciesData class that is initialized internally and used by the Species class, which provides a few other utility functions while acting as a wrapper for the SpeciesData object. The __init__.py file present under each available dataset folder, acts as a script that converts the data from different dataset formats, and standardizes it by creating corresponding Species class instances. These instances form the fundamental format of data on which AtomDB works.

Not every property is available in each one of the standard datasets available under the atomdb/datasets/ directory. Only certain fields of the Species object can be populated (depending on the dataset being used), while the others are set to default values (usually None). Hence, as mentioned in this issue (#15) it is important to enable discovery of properties, as well as handling potential attempts at accessing properties that are unavailable for a given dataset, in a graceful manner. The various properties being targetted have been broadly classified into Scalars and Vectors and have been listed in #4 .

Proposed Solutions

The following points aim to address potential solutions for issues #15 and #16

[ ] The return values of the __init__.py scripts present in every dataset folder gives us a list of the properties that are populated in the Species class object. The proposed solution under #16 would allow containerization of the dataset related documentation in the same file. Apart from this, we may also specify the available properties for a given dataset as part of the DOCSTRING (in the form of a table of properties) as mentioned in the first discussion under #15
[ ] To handle cases where a user tries to access a property that is not present or available as a part of the Species instance returned for the atomic species in the current dataset, we can override the __getattr__ method. A proposed solution is to have a PROPERTIES list in the {dataset}/__init.py__ file which, as the name suggests, would consist of the various properties obtainable or available upon loading the dataset. The Species instance can then be initalized with this list. Accessing any properties outside of this list would cause an attribute lookup failure which can be gracefully handled by the __getattr__ method that we override.

Questions

The Sphinx Documentation also needs to be updated right ? Also, I noticed that the README currently does not have steps required to build the Sphinx Documentation. Would that be a good addition for developers who might want to build the documentation locally ?
Rather than having separate variables in the {dataset}/__init__.py file why not create a dictionary called METADATA which could hold all such information related to the source dataset. In such an implementation, DOCSTRING and PROPERTIES become the keys of the METADATA dictionary, allowing easier addition of more such information as development progresses.

msricher commented 4 months ago

I like these solutions.

Ansh-Sarkar commented 4 months ago

Thanks @msricher ! Have started working on these solutions.

Ansh-Sarkar commented 3 months ago

Hi ! A quick update on the progress. Will raise a PR soon if this is going in the right direction (currently only making changes to the gaussian dataset).

[x] The docstring now contains all available properties for that particular dataset #16 . All the properties information are stored in a separate file and shared across all datasets for generation of docstrings as well as the sphinx documentation.
This is what the current docstring looks like when printed.

And the same properties are translated into separate tables in the sphinx documentation.

[x] I have divided all the dataset attributes into 3 categories, namely - GENERAL, SCALAR and VECTOR. The SCALAR and VECTOR types were proposed in #4 . Any other property that cannot be categorized under these 2, is categorized as GENERAL. Any property that is not defined in the properties file is categorized as MISCELLANEOUS.

Would love to hear feedback if any, so that I could make any necessary corrections. Thanks :smile: and have a great day ahead !

Ansh-Sarkar commented 3 months ago

Hey @gabrielasd @msricher just wanted to ask if I am good to go with the above changes and open a PR for the same ? Do let me know if you have any feedback on this. Thanks !

gabrielasd commented 3 months ago

Hi @Ansh-Sarkar , please do submit your changes as a PR, that way we can better look at the code and give you feedback.

theochem / AtomDB