reimandlab / ActiveDriverDB

ActiveDriverDB
GNU Lesser General Public License v2.1
12 stars 3 forks source link

Migrate from BerkeleyDB to Lightning DB #157

Closed krassowski closed 5 years ago

krassowski commented 5 years ago

This PR enables import of mapings for the v2019 version, which uses newer Python version.

Rationale for the migration: BDB is a good, hash key-value database based on B+ tree, but:

Fortunately, I found out about Lightning DB before wasting more weekends on this; LMDB:

on the negative side:

Benchmark: the new solution turned out to be 18% faster when tested against a VCF file with 5000 ClinVar mutation records (2.2 MB):

However, this may be influenced by many factors as there were much more changes between the versions. Importantly we do not have a performance regression.

Here is an external benchmark comparing LMDB to other solutions (unfortunately no BDB) - it shows that it is slow for writing the data but very fast in random reads - and this is perfect for our use case of retrieving genome-proteome mappings (which are generated only once, upfront).

PS. I was also able to simplify the mappings imports and reduce memory usage (which was necesarry to proceed anyway) in this PR; the memory usage is now O(1) with regards to the genome size.