metaspace2020 / Lithops-METASPACE

Lithops-based Serverless implementation of the METASPACE spatial metabolomics annotation pipeline
12 stars 4 forks source link

Database creation without Spark/PyWren #73

Closed LachlanStuart closed 4 years ago

LachlanStuart commented 4 years ago

build_database_local is a drop-in replacement for build_database, except that it doesn't use Spark or PyWren.

With input_config_big it takes 90s on my 4-core machine, approximately the same time as build_database. With larger database configs (e.g. 10 databases, 4 adducts, 4 modifiers) it's a bit slower on processing (13 minutes with build_database_local vs 7m build_database).

However, the big benefit is that it only has half as much code. The data is kept in a single unsegmented dataframe, which makes it a lot easier to modify.

When comparing the results to the old output, I found that store_formula_to_id_chunk actually had a bug that caused it to not read the last formulas_chunk. This PR includes a fix to that bug.