rdkit / mmpdb

A package to identify matched molecular pairs and use them to predict property changes.
Other
197 stars 55 forks source link

Use SQLAlchemy to work with the database #38

Open adalke opened 2 years ago

adalke commented 2 years ago

In 2019 kzfm gave an example of using SQLAlchemy to work with the database.

In my work to replace the JSON-Lines fragment format with a SQLite-based format I quickly discovered that I am hand-writing an ORM. Poorly.

I propose switching using SQLAlchemy in this updated version I am working on.

At this point I don't know how much work that requires. kzfm showed that defining the structure and querying the generated mmpdb database was simple.

adalke commented 2 years ago

I developed a prototype using SQLAlchemy to generate the new fragment database instead of using hand-built SQL queries

My test set was 100 SMILES. With low-level SQL queries and 4 processors it takes ~28.6 seconds. Going through SQLAlchemy's ORM took ~37.5 seconds -- 30% slower.

Profiling shows that most of the additional time is in session.commit(). However, even if I comment out the session.add() code (which tells SQLAlchemy to add the new objects to the database when appropriate), the overall time was still slower than the hand-written code.

Further profiling shows about 15% of the additional time was in _initialize_instance. This is part of SQLAlchemy's ORM. I replaced mmpdb's existing fragment types with slightly modified versions that inherit from registry.generate_base() base class. This has its own __init__, with a higher overhead than the basic Python-class-with-slot-definition I used.

There were some improvements to the API. I got to strip out some boilerplate code to return a FragmentRecord from the cache file. On the other hand, I had to lean how to "detach" objects to move them from one database to the other, and I had to learn the special query syntax.

When we get to things like merging multiple datasets into one, I think it would be easier to work directly to the SQLite connection object, attach databases, in the instance, and do cross-data INSERT with SELECT. I don't want to have to figure it out in SQLAlchemy.

Long-term, I think there's still a place for SQLAlchemy, as with kzfm's example.

The most likely is to define the database schema(s) in a better way than my shaky template system, and use it to manage schema creation, and to replace the vendored peewee code we use now, and use the more low-level SQLAlchemy calls to write the tables, rather than SQLAlchemy's ORM.