General changes - Githubissues

metaspace2020 / Lithops-METASPACE

Lithops-based Serverless implementation of the METASPACE spatial metabolomics annotation pipeline

12 stars 4 forks source link

General changes #38

Closed omerb01 closed 5 years ago

omerb01 commented 5 years ago

some general changes include:

support in default input config file for run scripts.
move map function out of a reduce function for dataset segmentation.
add pywren's stats monitoring for the new formulas algorithm
reduce significantly the number of temp files in the process of build_database()

gilv commented 5 years ago

@omerb01 per my email, I think the current implementation will load all DBs into memory, which might be not optimal. I think it better loop over DBs and not load them into mols_list

something like

    for db in db_list:
        bucket = db[0:db.index("/")]
        db_path = db[db.index("/") + 1:]
        db_name = db_path.split('/')[-1].split('.')[0]
        logging.debug("{} {} {}".format(bucket, db_path, db_name))
        res = ibm_cos.get_object(Bucket = bucket, Key = db_path)
        mols = pickle.loads(res['Body'].read())

        for modifier in modifiers:
            formulas.update(map(safe_generate_ion_formula, mols, repeat(modifier), repeat(adduct)))

omerb01 commented 5 years ago

@gilv I made a comparison between the 2 ways and found that its faster to load first all mols databases into mols_list in parallel. if you are worry about memory issues, notice that each mols database size is 2KB in estimate, so we have no issues in terms of memory.

gilv commented 5 years ago

@omerb01 of course it's faster to load all mols in parallel and not one by one. But I think 2kb mols is a temporal...in production they may have very large mols database, so if we load them all together into same memory we will likely get out of memory. @LachlanStuart correct me if I am wrong.

LachlanStuart commented 5 years ago

@omerb01 Do you mean that each mols database chunk is 2KB? Because each DB .csv/.pickle file should be ~100-250KB, containing ~10000 to ~25000 formulas each.

The databases we've provided are representative of what we use in production, but Theo is still working on getting the significantly bigger database we provided.

The total size of the database after the formula generation in build_databases should be approximately: 29,000 formulas * 84 adducts * 25 modifiers * ~30 bytes per dataframe row = 1.7GB before deduplication. However, I would expect it to drop significantly after deduplication to possibly around 500MB. Are these close to the sizes you're seeing @omerb01 ?

omerb01 commented 5 years ago

@LachlanStuart yep, with huge2.json I get ~24M formulas after deduplication and ~93M centroids