rformassspectrometry / CompoundDb

Creating and using (chemical) compound databases
https://rformassspectrometry.github.io/CompoundDb/index.html
17 stars 16 forks source link

Support multiple SDF files #16

Open stanstrup opened 7 years ago

stanstrup commented 7 years ago

For example for pubchem.

Multithreading with pbapply would be nice.

See also https://github.com/EuracBiomedicalResearch/CompoundDb/issues/1#issuecomment-340341955

stanstrup commented 7 years ago

If compound_tbl_sdf was internal to createCompDb (so you'd always call createCompDb directly) you could append the sqlite file instead to avoid the memory requirements. This was what I did in my approach for pubchem.

jorainer commented 7 years ago

Note: createCompDb does already support to generate a CompDb from multiple input files. The man page does also tell you that you can provide the name(s) of the file(s). I will make it more clearly in the help page. So far I used lapply to process multiple files - I'll switch to bplapply.

jorainer commented 7 years ago

OK, I have extended the documentation a little. I've also tried to enable parallel processing, but that's not possible because SQLite/RSQLite does not support concurrent write operations. I've also tried: https://stackoverflow.com/questions/36831302/parallel-query-of-sqlite-database-in-r and https://www.r-bloggers.com/synchronization-for-r-with-the-flock-package/ but that didn't help either. So, presently it's not possible.

stanstrup commented 7 years ago

Ah yes I tried the exact same things. That's why I ended up doing an sqlite for each SDF and then constructing the final sqlite after the parallel runs.