GSoC 2024: Refactor database structure

msricher commented 9 months ago

Description

Update the AtomDB API to use a better (de)serialization method based on a Python database library, such as ZODB.

:books: Package Description and Impact

AtomDB is a database of chemical and physical properties for atomic and ionic species. It includes a Python library for submitting computations to generate database entries, accessing entries, and interpolating their properties at points in space. AtomDB currently uses MsgPack for (de)serializing database entries (instances of dataclasses), but the deserialization is slow, complicated, and uses poor Python practices. This project will involve updating the AtomDB API to use a better (de)serialization method based on a proper database library, such as ZODB, which has seamless interoperability with Python classes and objects. This is a key milestone on AtomDB release schedule.

:construction_worker: What will you do?

You will update the AtomDB API to replace the MsgPack-based (de)serialization functions database entry files with the ZODB database library. You will port the atomic/ionic species class to be a standalone class (instead of dataclass + wrapper), which will provide transparent (de)serialization with ZODB. Finally, you will port the existing AtomDB entry files to the new database, and modify the build files (pyproject.toml) so that the new database is included with user installations of AtomDB.

:checkered_flag: Expected Outcomes

(De)serialization works transparently with instances of Species, and is done to and from a ZODB database.
Species is made a standalone class (not a dataclass), by subclassing the ZODB persistent object base class.
Old MsgPack database entries are ported to the new database.
Build files are updated to reflect the change in database files shipped with AtomDB.
The AtomDB API is tested after the previous changes are made.


Required skills	Python, OOP, Linux
Preferred skills	Database experience
Project size	175 hours, Medium
Difficulty	Medium

:raising_hand: Mentors


Michelle Richer	richer.m_at_queensu_dot_ca	@msricher
Gabriela Sánchez-Díaz	sanchezg_at_mcmaster_dot_ca	@gabrielasd
Farnaz Heidar-Zadeh	farnaz.heidarzadeh_at_queensu_dot_ca	@FarnazH

9401adarsh commented 8 months ago

Hello there, this is Adarsh here. I am a software dev with less than one year of experience and a beginner looking to get into open-source. I am fairly well-versed in Python, and OOPs concepts, although have to brush up on my Linux knowledge. However, I think this problem is right up my alley.

Are there any documentation or videos available that give an overview of the codebase ? Also would appreciate, if you could suggest some warm-up tasks to get me started.

Aditish51 commented 8 months ago

Hello, This is Aditi a 3rd year CSE Undergrad. I am skilled in python . I am interested in contributing to this project.

msricher commented 8 months ago

To applicants:

We just found out we are approved for 2024, so we will probably get 1-3 students. (We do not know how many students we will get until later.) This is the first time we've applied as an independent organization.

We will need you to write a proposal when the application begins (March 18th), based on the template on our website. Often, until then, people go on and start working, either on this project or on a "good first issue" in another repo to become familiar with the QC-Devs ecosystem, before starting. (That's always helpful, too, as the internships can be quite competitive.)

Right now, documentation is sparse (in the docs/ folder), but you can see the Jupyter notebooks here for examples of how the code is used.

I tried writing a small example of how I would like the database code to be implemented, using ZODB, although I am open to other options (suggestions are welcome, I'm not a database expert, I just want to be able to have databases which can be kept in a correct state and be version-controlled). I am attaching these files to a Gist here. They contain my notes for how it would be integrated into the existing code. I will work together with the successful applicants to make sure we arrive at a good solution.

I hope this information helps! Let me know if you have any more questions.

Best, Michelle

msricher commented 8 months ago

Other database options would include jsonpickle, and there are other simpler options. The main goal is just to replace the old (de)serialization to/from msgpack files with a more structured database that is a single file, which can be loaded and then queried, iterated through, etc.

pysondb tinydb pickledb

9401adarsh commented 8 months ago

Thanks for the input, Michelle. Will start looking into the resources you have shared here. Will reach out in the same thread, in case I hit any snags.

harshnayangithub commented 8 months ago

Hello, I'm Harsh, a second-year undergraduate student pursuing B.E in Computer Science. I am a full-stack developer with a strong grasp of Python and OOPs concepts, and I am highly experienced with the Linux operating system. I'm looking forward to contribute to this project.

Aditish51 commented 8 months ago

@gabrielasd I am facing problems during the installation of AtomDb

msricher commented 8 months ago

@harshnayangithub Thank you for your interest. See my previous comments starting here.

@Aditish51 see this discussion here.

msricher commented 7 months ago

Hi all. I got time to play with database libraries in Python, and I think I found the best solution for our project.

TinyDB is a very simple database library that has all the saving/loading/querying features we need. PyYaml allows me to make a backend to TinyDB that saves things in (gzip-compressed) YAML format. This is best because it is human readable, and it supports binary blocks, allowing us to store NumPy arrays.

This lets me do something like:

import gzip

import numpy

import yaml

from tinydb import TinyDB, Storage
from tinydb_serialization import Serializer, SerializationMiddleware

class YAMLStorage(Storage):

    def __init__(self, filename):
        self.filename = filename

    def read(self):
        with gzip.open(self.filename, "rt") as handle:
            try:
                data = yaml.safe_load(handle.read())
                return data
            except yaml.YAMLError:
                return None

    def write(self, data):
        with gzip.open(self.filename, "wt+") as handle:
            yaml.dump(data, handle)

    def close(self):
        pass

class NDArraySerializer(Serializer):

    OBJ_CLASS = numpy.ndarray

    def encode(self, obj):
        return obj.data if obj.flags["C_CONTIGUOUS"] else obj.tobytes()

    def decode(self, buf):
        return numpy.frombuffer(buf)

SERIALIZATION = SerializationMiddleware(YAMLStorage)
SERIALIZATION.register_serializer(NDArraySerializer(), "TinyNDArray")

DB = TinyDB("db.gz", storage=SERIALIZATION)

if __name__ == "__main__":

    import numpy as np
    from tinydb import where

    DB.insert({"a": [1, 2, 3], "b": np.ones((3, 3)), "c": 2, "d": 3.14159})

    arr = DB.search(where("c") == 2)[0]["b"]
    print(arr)

I think this beats out ZODB in terms of writeability, readability, and speed.

I'm still open to more ideas, but I'd be happy to have someone implement a better database into AtomDB based on this. Let me know what you think.

Aditish51 commented 7 months ago

Hello @msricher Thanks for sharing your perspective. The idea of using TinyDB for atomdb for storing ,reading etc. is great as TinyDB is a lightweight, file-based database written in Python.It offers a Pythonic API, which can be advantageous as we are already working within a Python environment.TinyDB is designed to support different storage backends, allowing developers to use alternative formats or databases for data storage for eg. yaml, json etc.

But, ZODB according to me is great too and in fact better than small and lightweight databases when it comes to scalability, ACID properties, transactions etc. ZODB is an object-oriented database that stores Python objects directly. So, if in future our dataset grows to a large scale ZODB might handle it easily.I apologize if I missed some point or if I was not correct at any point. I'm open to further discussion and hearing more about your thoughts on this matter. I am good with both . Kindly suggest me more on this. Ref: https://www.opensourceforu.com/2017/05/three-python-databases-pickledb-tinydb-zodb/ https://zodb.org/en/latest/guide/transactions-and-threading.html

msricher commented 7 months ago

Thank you for the feedback.

Ultimately it's a matter of opinion. I suggested TinyDB here because I can store the data as compressed YAML (still human-readable, vs. ZODB which I think is not human-readable and would be harder to export to different formats, for example in the future, as I think YAML will outlive ZODB). It's also much easier for programmers with less DB experience (like myself, I'm not entirely comfortable with ZODB).

On the other hand, you're right, the ACID guarantees of ZODB are really important! I would say that if the successful applicant wants to build (and is capable of building) the database structure in ZODB (or another ACID object DB), then they should do so. They can then just put in a function to export the DB to YAML.

We will discuss this when I go over applications, conduct interviews, etc.

Aditish51 commented 7 months ago

Hello, @msricher Before submitting the proposal officially Can I submit it for a feedback ?. It would be valuable and helpful to me.

msricher commented 7 months ago

You can just submit it, I'll read all of them at the same time, it seems most fair that way. Just follow the guidelines.

Aditish51 commented 7 months ago

@msricher Thanks for your response. I sincerely apologize . I will go on with guidelines strictly.

Ansh-Sarkar commented 7 months ago

Had a small input regarding the choice of database.

As mentioned above it is definitely important for a database to conform to the ACID properties; something that ZODB clearly does. ZODB also has web based integrations available that allow us to easily check stored objects. It also seems to have a higher number of contributors and therefore might just be the ideal choice.

On the other hand, TinyDB, though lacks in some really important aspects, does provide a higher degree of freedom and modifiability where required by the project. TinyDB also has a ton of extensions targetting specific shortcomings of the original database (whether or not there exists an extension that takes care of the ACID properties, needs further research). It also has a larger active community and I 100% agree that YAML is a much easier format to work with for humans.

It might just be possible to develop our own extension of TinyDB that is suited for scientific applications (could also be helpful for other open source projects), but yes that is something that is upto the selected candidate and the scope that is defined for the project under the program.

Aditish51 commented 6 months ago

Hello @msricher, @FarnazH @gabrielasd ,

It was a sad thing to know that probably the project itself was not accepted by google. I learnt a lot while preparation of the proposal and understood the demands of the project well. If you allow me then I would still like to work on this project. It would be a privilege to me. I request you to please consider me once.

Thank You

PaulWAyers commented 6 months ago

Hi @Aditish51 , the project was one of the 7 accepted as "fundable" by Google but we only got 2 of the 7 projects funded in the end. (The projects have to be rank-ordered, and within each project the applicants have to be rank-ordered, and in the end Google funded 2 of the projects, which was already above normal for a 1st-time organization.)

We always welcome contributions to QC-Devs, so you are always welcome to contribute. @msricher may have some time to mentor you; @gabrielasd is finalizing her Ph.D. thesis and will be pretty busy the next couple months I think.

I'd encourage you to make contributions using, to the greatest extent possible, the standard QC-Devs workflow/contributing guidelines. I.e., you may want to make issues (or sub-issues) for the tasks you like to work on, then use that as an organizing principle for the various tasks you're interested in.

msricher commented 6 months ago

@Aditish51 Yes, as @PaulWAyers says this project was not accepted by Google for funding. If you would still like to contribute, we could always use your help. Right now, we're in the process of submitting a journal article about AtomDB, and then I may be able to guide you through making some of the contributions that were discussed here as part of the GSoC applications. Thank you for your interest in the project!!

Aditish51 commented 5 months ago

Hello @PaulWAyers , @msricher Thank you for considering my request and giving me the opportunity to complete this project. Looking forward for your guidance and further steps.

Aditish51 commented 3 months ago

Hello @msricher @PaulWAyers @gabrielasd if this project is still open I would like to work on this.

PaulWAyers commented 3 months ago

@msricher is on vacation this week, but if she has capacity starting next week, certainly it would be good to have you contribute.

Aditish51 commented 3 months ago

Okay @PaulWAyers thank you. Till then I will try to look into some issues.

theochem / AtomDB