Open gabrielasd opened 5 months ago
Hi @gabrielasd . Would love to work on this issue. Am new to AtomDB and currently exploring the codebase in order to be able to help with some meaningful contributions.
By the way I was referring to the Getting Started NoteBook and the following link (for a list of all the datasets available) seems to be pointing back to the same notebook.
Hi @Ansh-Sarkar, glad to hear you are interested in contributing to our package! And thanks for bringing this to our attention. You are correct, some of the links in that notebook are broken. Since this package is still in the stage of development for being released, there might be some incomplete or missing documentation.
For now I can point you to the folder where all datasets will be found once compiled:
https://github.com/theochem/AtomDB/tree/master/atomdb/datasets
From there the folders nist
and slater
contain some compiled atomic data (inside a db
folder) so that one can explore atomdb's features.
I'd also suggest you look at the hello_atomdb notebook, which is a more resent version of the one you are looking at.
In regards to this issue, I should mention it is also related to #16. They both refer to the same problem of how to document and make available to the user the properties stored from each source of atomic data. For an idea of the properties we aim to support you can see issue #4.
Please, don't hesitate to ask if you have any questions or need further assistance.
Thank you @gabrielasd for directing me to the aforementioned resources. They were very helpful! I've gone through the entire codebase and tried to analyze how everything comes together. Briefly mentioning my learnings here before commencing to a few proposed solutions. Please do let me know if there are any flaws in my understanding. Hopefully this also helps new contributors get started with their contributions faster.
All the datasets can be found in this folder: /atomdb/datasets/
. Under some of the folders corresponding to a dataset (/gaussian
, /nist
, /slater
as of this writing) we have a folder named db
consisting of certain MessagePack files that act as the source of data.
Whenever the load()
method is called with specific parameters (element
, charge
, multiplicity
, ...optionals
) data is fetched from the MessagePack files and an object of the Species
class is returned. The actual data is present in an instance of the SpeciesData
class that is initialized internally and used by the Species
class, which provides a few other utility functions while acting as a wrapper for the SpeciesData
object. The __init__.py
file present under each available dataset folder, acts as a script that converts the data from different dataset formats, and standardizes it by creating corresponding Species
class instances. These instances form the fundamental format of data on which AtomDB works.
Not every property is available in each one of the standard datasets available under the atomdb/datasets/
directory. Only certain fields of the Species
object can be populated (depending on the dataset being used), while the others are set to default values (usually None
). Hence, as mentioned in this issue (#15) it is important to enable discovery of properties, as well as handling potential attempts at accessing properties that are unavailable for a given dataset, in a graceful manner. The various properties being targetted have been broadly classified into Scalars
and Vectors
and have been listed in #4 .
The following points aim to address potential solutions for issues #15 and #16
[ ] The return values of the __init__.py
scripts present in every dataset folder gives us a list of the properties that are populated in the Species
class object. The proposed solution under #16 would allow containerization of the dataset related documentation in the same file. Apart from this, we may also specify the available properties for a given dataset as part of the DOCSTRING
(in the form of a table of properties) as mentioned in the first discussion under #15
[ ] To handle cases where a user tries to access a property that is not present or available as a part of the Species
instance returned for the atomic species in the current dataset, we can override the __getattr__
method. A proposed solution is to have a PROPERTIES
list in the {dataset}/__init.py__
file which, as the name suggests, would consist of the various properties obtainable or available upon loading the dataset. The Species
instance can then be initalized with this list. Accessing any properties outside of this list would cause an attribute lookup failure which can be gracefully handled by the __getattr__
method that we override.
The Sphinx Documentation also needs to be updated right ? Also, I noticed that the README currently does not have steps required to build the Sphinx Documentation. Would that be a good addition for developers who might want to build the documentation locally ?
Rather than having separate variables in the {dataset}/__init__.py
file why not create a dictionary called METADATA
which could hold all such information related to the source dataset. In such an implementation, DOCSTRING
and PROPERTIES
become the keys of the METADATA
dictionary, allowing easier addition of more such information as development progresses.
I like these solutions.
Thanks @msricher ! Have started working on these solutions.
Hi ! A quick update on the progress. Will raise a PR soon if this is going in the right direction (currently only making changes to the gaussian
dataset).
[x] The docstring now contains all available properties for that particular dataset #16 . All the properties information are stored in a separate file and shared across all datasets for generation of docstrings as well as the sphinx documentation.
This is what the current docstring looks like when printed.
GENERAL
, SCALAR
and VECTOR
. The SCALAR
and VECTOR
types were proposed in #4 . Any other property that cannot be categorized under these 2, is categorized as GENERAL
. Any property that is not defined in the properties file is categorized as MISCELLANEOUS
.Would love to hear feedback if any, so that I could make any necessary corrections. Thanks :smile: and have a great day ahead !
Hey @gabrielasd @msricher just wanted to ask if I am good to go with the above changes and open a PR for the same ? Do let me know if you have any feedback on this. Thanks !
Hi @Ansh-Sarkar , please do submit your changes as a PR, that way we can better look at the code and give you feedback.
We should have a table of which property is available for each dataset, both in the code, and in the published documentation.
In the library, we should also gracefully handle error cases where the user attempts to access a property that is unavailable in the current dataset.