rdkit / mmpdb

A package to identify matched molecular pairs and use them to predict property changes.
Other
197 stars 55 forks source link

Memory error with mmpdb fragment for large dataset #27

Closed chengthefang closed 3 years ago

chengthefang commented 3 years ago

Hi all,

I am trying to build a MMP-DB with 10M compounds. But I got an error at the first step of fragmentation.

The command I used is as follows: python mmpdb fragment first10M.smi --num-jobs 8 -o first10M.fragments.gz

The error I got is: Traceback (most recent call last): File "/home/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/home/anaconda2/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "/home/anaconda2/lib/python2.7/multiprocessing/pool.py", line 328, in _handle_workers pool._maintain_pool() File "/home/anaconda2/lib/python2.7/multiprocessing/pool.py", line 232, in _maintain_pool self._repopulate_pool() File "/home/anaconda2/lib/python2.7/multiprocessing/pool.py", line 225, in _repopulate_pool w.start() File "/home/anaconda2/lib/python2.7/multiprocessing/process.py", line 130, in start self._popen = Popen(self) File "/home/anaconda2/lib/python2.7/multiprocessing/forking.py", line 121, in __init__ self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

Does anybody have comments or suggestions on that? Also, can I run the command on distributed nodes on the cluster?

ps: I also have similar concerns about the second step of indexing since it usually takes longer time and larger memory than the fragmentation. Can I run the indexing command in parallel or on the distributed cluster?

Thanks, Cheng

KramerChristian commented 3 years ago

Dear Cheng,

I suppose that the error you observe comes from running out of memory. 10M compounds is a very large dataset for mmpdb. Could you try to run the mmpdb fragmentation with just the first 10K compounds and check whether you get the same error?

Overall, mmpdb is not made to run with such large datasets. The largest set for which I successfully created a DB usign standard mmpdb was roughly in the range of 1M compounds (using 256 GB of RAM). However, with that DB size queries within the DB take pretty long.

If you really want to fragment 10M compounds, you could just cut your input .smi file into smaller chunks, distribute the fragmentation for each of the chunks on a cluster, and cat the results back together (after removing the header). Indexing in the current version of mmpdb is not parallelized, and your workflow for 10M compounds will likely fail here due to running out of memory.

The indexing step can be parallelized, but this is very involved and requires a rewrite of the code.

I hope this clarifies the situation. Please let me know whether fragmentation works for you with smaller datasets.

Bests, Christian

chengthefang commented 3 years ago

Hi Christian,

Thank you for your prompt response. Yes, I ran the mmpdb command for 1M compounds. Everything works fine.

So I think I can split 10M compounds into 10 chunks and fragment each one, and combine them together. That's a good way. Thanks!

Regarding the indexing step, I am planning to put some constrains in order to reduce the size as you suggested in an old post (https://github.com/rdkit/mmpdb/issues/6), for example reduce --max-radius =3, --min-heavies-per-const-frag =3, and turn on the --smallest-transformation-only flag. Is there any other suggestions on controlling the size of generated DB without losing meaningful transforms?

Thanks, Cheng

KramerChristian commented 3 years ago

Dear Cheng,

I do not want to be demotivating, and I do not know how much memory you have available - but I have strong doubts that you will be able to index a DB with 10M compounds, even with the most restrictive settings. If you want to reduce the DB size, you can also restrict the fragment size to very small fragments. However, you will always have a tradeoff between creating a manageable DB size and loosing interesting pairs.

Bests, Christian

chengthefang commented 3 years ago

Dear Christian,

Thank you for your comments. I agree that there is always a tradeoff between the size of DB and the pair information. I will do some tests to see how it works.

Thanks, Cheng