This PR enables import of mapings for the v2019 version, which uses newer Python version.
Rationale for the migration: BDB is a good, hash key-value database based on B+ tree, but:
in 2013 Oracle changed licensing of BDB which caused it to be excluded from almost all major Linux distributions; it was also dropped from the Python project in version 3.0 (previously there was built-in support for it)
while Python bindings were maintained in the community fork for years we were using these for a long time), it got increasingly difficult to compile and install BDB for newer versions of Python and on newer Linux distributions; last time it took me three weeks to figure out how to resolve all the issues
creating large databases with limited server resources was not that easy with BDB in the first place
I was not able to quickly (< 2 days) understand why BDB did not compile properly this time
Fortunately, I found out about Lightning DB before wasting more weekends on this; LMDB:
was already commonly adopted as a drop-in substitute for BDB
is also based on B+ tree
has certain features making it more appealing for our use case (minimalistic, memory maps, etc)
was able to import all the mappings in < 6 hours (BDB: ~ 24 hours)
on the negative side:
it uses a little bit more space on the disk (4GB + 6GB, compared to 2GB + 4 GB for BDB)
support for LMDB Python bindings does not seem so great right now
Benchmark: the new solution turned out to be 18% faster when tested against a VCF file with 5000 ClinVar mutation records (2.2 MB):
v2017 with BerkeleyDB: 127s
v2019 with Lightning DB: 104s
However, this may be influenced by many factors as there were much more changes between the versions. Importantly we do not have a performance regression.
Here is an external benchmark comparing LMDB to other solutions (unfortunately no BDB) - it shows that it is slow for writing the data but very fast in random reads - and this is perfect for our use case of retrieving genome-proteome mappings (which are generated only once, upfront).
PS. I was also able to simplify the mappings imports and reduce memory usage (which was necesarry to proceed anyway) in this PR; the memory usage is now O(1) with regards to the genome size.
This PR enables import of mapings for the v2019 version, which uses newer Python version.
Rationale for the migration: BDB is a good, hash key-value database based on B+ tree, but:
Fortunately, I found out about Lightning DB before wasting more weekends on this; LMDB:
on the negative side:
Benchmark: the new solution turned out to be 18% faster when tested against a VCF file with 5000 ClinVar mutation records (2.2 MB):
However, this may be influenced by many factors as there were much more changes between the versions. Importantly we do not have a performance regression.
Here is an external benchmark comparing LMDB to other solutions (unfortunately no BDB) - it shows that it is slow for writing the data but very fast in random reads - and this is perfect for our use case of retrieving genome-proteome mappings (which are generated only once, upfront).
PS. I was also able to simplify the mappings imports and reduce memory usage (which was necesarry to proceed anyway) in this PR; the memory usage is now
O(1)
with regards to the genome size.