Open adalke opened 3 years ago
I developed a prototype using SQLAlchemy to generate the new fragment database instead of using hand-built SQL queries
My test set was 100 SMILES. With low-level SQL queries and 4 processors it takes ~28.6 seconds. Going through SQLAlchemy's ORM took ~37.5 seconds -- 30% slower.
Profiling shows that most of the additional time is in session.commit()
. However, even if I comment out the session.add()
code (which tells SQLAlchemy to add the new objects to the database when appropriate), the overall time was still slower than the hand-written code.
Further profiling shows about 15% of the additional time was in _initialize_instance
. This is part of SQLAlchemy's ORM. I replaced mmpdb's existing fragment types with slightly modified versions that inherit from registry.generate_base() base class. This has its own __init__
, with a higher overhead than the basic Python-class-with-slot-definition I used.
There were some improvements to the API. I got to strip out some boilerplate code to return a FragmentRecord from the cache file. On the other hand, I had to lean how to "detach" objects to move them from one database to the other, and I had to learn the special query syntax.
When we get to things like merging multiple datasets into one, I think it would be easier to work directly to the SQLite connection object, attach databases, in the instance, and do cross-data INSERT with SELECT. I don't want to have to figure it out in SQLAlchemy.
Long-term, I think there's still a place for SQLAlchemy, as with kzfm's example.
The most likely is to define the database schema(s) in a better way than my shaky template system, and use it to manage schema creation, and to replace the vendored peewee code we use now, and use the more low-level SQLAlchemy calls to write the tables, rather than SQLAlchemy's ORM.
In 2019 kzfm gave an example of using SQLAlchemy to work with the database.
In my work to replace the JSON-Lines fragment format with a SQLite-based format I quickly discovered that I am hand-writing an ORM. Poorly.
I propose switching using SQLAlchemy in this updated version I am working on.
At this point I don't know how much work that requires. kzfm showed that defining the structure and querying the generated mmpdb database was simple.