Closed omerb01 closed 5 years ago
@omerb01 per my email, I think the current implementation will load all DBs into memory, which might be not optimal. I think it better loop over DBs and not load them into mols_list
something like
for db in db_list:
bucket = db[0:db.index("/")]
db_path = db[db.index("/") + 1:]
db_name = db_path.split('/')[-1].split('.')[0]
logging.debug("{} {} {}".format(bucket, db_path, db_name))
res = ibm_cos.get_object(Bucket = bucket, Key = db_path)
mols = pickle.loads(res['Body'].read())
for modifier in modifiers:
formulas.update(map(safe_generate_ion_formula, mols, repeat(modifier), repeat(adduct)))
@gilv I made a comparison between the 2 ways and found that its faster to load first all mols databases into mols_list
in parallel. if you are worry about memory issues, notice that each mols database size is 2KB in estimate, so we have no issues in terms of memory.
@omerb01 of course it's faster to load all mols in parallel and not one by one. But I think 2kb mols is a temporal...in production they may have very large mols database, so if we load them all together into same memory we will likely get out of memory. @LachlanStuart correct me if I am wrong.
@omerb01 Do you mean that each mols database chunk is 2KB? Because each DB .csv
/.pickle
file should be ~100-250KB, containing ~10000 to ~25000 formulas each.
The databases we've provided are representative of what we use in production, but Theo is still working on getting the significantly bigger database we provided.
The total size of the database after the formula generation in build_databases
should be approximately: 29,000 formulas * 84 adducts * 25 modifiers * ~30 bytes per dataframe row = 1.7GB
before deduplication. However, I would expect it to drop significantly after deduplication to possibly around 500MB. Are these close to the sizes you're seeing @omerb01 ?
@LachlanStuart yep, with huge2.json
I get ~24M formulas after deduplication and ~93M centroids
some general changes include:
build_database()