tskit-dev / tskit

Population-scale genomics
MIT License
151 stars 70 forks source link

Cannot pickle '_tskit.Tree' object #2971

Closed Silloky closed 1 month ago

Silloky commented 1 month ago

I am trying to multiprocess a function that takes a Tree object as an parameter. I understand multiprocessing requires serializing any data exchanged to other cores or even processors.

Thus, the Tree object needs to be pickled to be passed to the function running on another core. Here is my code :

from pathos.multiprocessing import ProcessingPool as Pool
...
pool = Pool(n_threads)
percentages = np.array(pool.map(functools.partial(f, pd_sequence), range(n_samples)))

pd_sequence is a pandas DataFrame containing tskit Tree objects.

This code throws :

TypeError                                 Traceback (most recent call last)
Cell In[186], [line 11](vscode-notebook-cell:?execution_count=186&line=11)
      [8](vscode-notebook-cell:?execution_count=186&line=8)     return bootstrap[bootstrap['monophyletic'] == True]['span'].sum() / bootstrap['span'].sum() # Gets the percentage of the sample trees that is monophyletic
     [10](vscode-notebook-cell:?execution_count=186&line=10) pool = Pool(n_threads) # Create a pool of 2 workers
---> [11](vscode-notebook-cell:?execution_count=186&line=11) percentages = np.array(pool.map(functools.partial(f, pd_sequence), range(n_samples))) # Run bootstrap twice in parallel

File /<redacted>/.conda/lib/python3.11/site-packages/pathos/multiprocessing.py:135, in ProcessPool.map(self, f, *args, **kwds)
    [133](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/site-packages/pathos/multiprocessing.py:133) AbstractWorkerPool._AbstractWorkerPool__map(self, f, *args, **kwds)
    [134](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/site-packages/pathos/multiprocessing.py:134) _pool = self._serve()
--> [135](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/site-packages/pathos/multiprocessing.py:135) return _pool.map(star(f), zip(*args))

File /<redacted>/.conda/lib/python3.11/site-packages/multiprocess/pool.py:367, in Pool.map(self, func, iterable, chunksize)
    [362](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/site-packages/multiprocess/pool.py:362) def map(self, func, iterable, chunksize=None):
    [363](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/site-packages/multiprocess/pool.py:363)     '''
    [364](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/site-packages/multiprocess/pool.py:364)     Apply `func` to each element in `iterable`, collecting the results
    [365](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/site-packages/multiprocess/pool.py:365)     in a list that is returned.
    [366](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/site-packages/multiprocess/pool.py:366)     '''
--> [367](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/site-packages/multiprocess/pool.py:367)     return self._map_async(func, iterable, mapstar, chunksize).get()

File /<redacted>/.conda/lib/python3.11/site-packages/multiprocess/pool.py:774, in ApplyResult.get(self, timeout)
    [772](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/site-packages/multiprocess/pool.py:772)     return self._value
    [773](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/site-packages/multiprocess/pool.py:773) else:
--> [774](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/site-packages/multiprocess/pool.py:774)     raise self._value
...
--> [578](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/pickle.py:578)     rv = reduce(self.proto)
    [579](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/pickle.py:579) else:
    [580](https://file+.vscode-resource.vscode-cdn.net/<redacted>/.conda/lib/python3.11/pickle.py:580)     reduce = getattr(obj, "__reduce__", None)

TypeError: cannot pickle '_tskit.Tree' object

I've tried debugging a bit using just dill and I can confirm a simple dill.dumps throws the same error.

I thought of transforming the Tree object to a dict using Tree.as_dict_of_dicts and then serializing. This works, but I then realised that I couldn't re-transform the dict into a valid Tree object, which is what I need...

I hope you'll have enough information here to understand the issue and reproduce.

Thanks !

benjeffery commented 1 month ago

Hi @Silloky! Due to the way the tskit C extension works you can't pickle a Tree. You can, however pickle a TreeSequence so I suggest passing that and an index if you really need to do so.

Silloky commented 1 month ago

Hello @benjeffery Thanks for your answer. Right, OK, I didn't know ; tskit is fairly new to me tbh. I'll try that then.

I'd just like your opinion though. In the code above, the ffunction only needs one single tree. Is it still efficient if I pass the entire TreeSequence instead of the one Tree it needs ?

Silloky commented 1 month ago

Finally, it turns out that the issue I was encountering is insignificant as I can easily bypass this whole problem. Thanks for your help and I am now closing this issue.