materialsvirtuallab / maml

Python for Materials Machine Learning, Materials Descriptors, Machine Learning Force Fields, Deep Learning, etc.
BSD 3-Clause "New" or "Revised" License
369 stars 80 forks source link

[Bug]: `Parallel`/`multiprocessing` do not work for `Describer`s #637

Open kavanase opened 5 months ago

kavanase commented 5 months ago

Email (Optional)

No response

Version

v2023.9.9

Which OS(es) are you using?

What happened?

Firstly, thanks for developing a really nice package! When trying to parse a list of Structure objects with M3GNetStructure(n_jobs=4).transform (to then perform DIRECT sampling), using the n_jobs argument to run this command in parallel (to speed up parsing as suggested in the example notebook), the following error is obtained, stating that the parsing functions are not pickle-able and thus unable to be used in parallel:

joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/joblib/externals/loky/backend/queues.py", line 159, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/joblib/externals/loky/backend/reduction.py", line 215, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/joblib/externals/loky/backend/reduction.py", line 208, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 632, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'weakref.ReferenceType' object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "M3GNet_Structure_DIRECT_generation.py", line 66, in <module>
    m3gnet_struct.transform(collated_data["structures"][:1000])
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 295, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/maml/base/_describer.py", line 122, in transform
    features = Parallel(n_jobs=self.n_jobs)(delayed(cached_transform_one)(self, obj) for obj in objs)
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 1952, in __call__
    return output if self.return_generator else list(output)
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 1595, in _get_outputs
    yield from self._retrieve()
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 1699, in _retrieve
    self._raise_error_fast()
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 1734, in _raise_error_fast
    error_job.get_result(self.timeout)
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 736, in get_result
    return self._return_or_raise()
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 754, in _return_or_raise
    raise self._result
_pickle.PicklingError: Could not pickle the task to send it to the workers.
Exception ignored in: <function _CheckpointRestoreCoordinatorDeleter.__del__ at 0x36761e290>
Traceback (most recent call last):
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/tensorflow/python/checkpoint/checkpoint.py", line 197, in __del__
TypeError: 'NoneType' object is not callable

When I tried to instead run the transform function in parallel using multiprocessing (Pool.imap_unordered()) rather than joblib's Parallel, I get a similar error about pickle-ability:

  File "M3GNet_Structure_DIRECT_generation.py", line 77, in <module>
    results = list(tqdm(
  File "/Users/kavanase/miniconda3/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/Users/kavanase/miniconda3/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
  File "/Users/kavanase/miniconda3/lib/python3.10/multiprocessing/pool.py", line 540, in _handle_tasks
    put(task)
  File "/Users/kavanase/miniconda3/lib/python3.10/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/Users/kavanase/miniconda3/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function _lambdifygenerated at 0x36e5f5000>: attribute lookup _lambdifygenerated on __main__ failed

For now, I can get around this by manually dividing up my dataset and running separate python jobs to parse these individual chunks, but it would be much easier for users if parallel processing was possible, as it can take quite a while.

Code snippet

M3GNetStructure(n_jobs=4).transform([list_of_structures])

Log output

No response

Code of Conduct

zz11ss11zz commented 2 months ago

Hi Kavanase, I suggest using M3GNetStructure().transform_one(). Then, you can parallelly generate respective features and combine them for analysis.