Pickling errors after checkpointing (with MPI)

ajdittmann commented 4 months ago

I cannot restart checkpointed jobs (on certain systems) because of the following error: _pickle.PicklingError: Can't pickle <function logl at 0x7f9f4464bb80>: it's not the same object as __main__.logl I first encountered this when submitting jobs through slurm on a cluster, but I can reproduce it with the following script on a login node, executed as described in the the documentation (I have also reproduced the error using other MPI pools).

import numpy as np
import pocomc as pc
from scipy.stats import uniform
from mpi4py.futures import MPIPoolExecutor

prior = pc.Prior([uniform(loc=-10, scale=20)]*20)

def logl(x):
  return -0.5*np.sum(x**2)

if __name__ == '__main__':
    with MPIPoolExecutor(2) as pool:
        sampler = pc.Sampler(prior, logl, pool=pool, output_dir='logs')
        sampler.run(save_every=1)
        #sampler.run(save_every=1, resume_state_path = 'logs/pmc_4.state')

This script runs fine with MPI on my laptop, which has mpi4py version 3.0.3 and OpenMPI 4.0.3 (the cluster has mpi4py 3.1.6 and OpenMPI 4.1.1 -- they have the same version of dill). I have attached a text file with the full error message here

minaskar commented 4 months ago

Hi @ajdittmann,

Can you confirm that the checkpoint file was created/saved and loaded on the same machine with the same versions of Python and pocoMC?

ajdittmann commented 4 months ago

Yes, I can confirm that the checkpoint file was created and loaded on the same machine, using the same versions of Python and pocoMC.

Your question also led me to check the file system types as well, since I originally ran that test on a BeeGFS partition, which has sometimes cause problems before. I could reproduce the problem on that cluster using the same version of python and pocoMC on an NFS4 partition instead. My laptop, where the code works, is ext4.

I also confirmed that restarting works correctly on the cluster environment when running without MPI.

minaskar commented 4 months ago

Can you try to install the checkpoint branch and let me know if that works?

The issue you’re encountering likely stems from the differences in how these file systems handle file locking and caching, which can affect the serialization and deserialization process used by pickle/dill.

ajdittmann commented 4 months ago

I think it also might be related to how the MPI Pool interacts with pickle/dill. Specifically, the checkpoint branch seems to work with some, but not all MPI pool implementations (the main branch actually seems to work as well, depending on the pool).

With the method suggested in the documentation, I get the same error as before. However, using the MPI pool provided by the schwimmbad package (only with the use_dill option) it seems to work (on both my NFS4 home directory and the GeeGFS high-performance file system).

I think the solution was actually using a dill-friendly MPI pool. Before, I had tried the method in the documentation and schwimmbad (but without setting use_dill=True). After setting use_dill=True, the main branch seems to work as well.

The working version of the script is

import numpy as np
import pocomc as pc
from scipy.stats import uniform
from schwimmbad import MPIPool

prior = pc.Prior([uniform(loc=-10, scale=20)]*20)

def logl(x):
  return -0.5*np.sum(x**2)

Pool = MPIPool(use_dill=True)

if not Pool.is_master():
  Pool.wait()
  sys.exit(0)

sampler = pc.Sampler(prior, logl, pool=Pool, output_dir='logs')
#sampler.run(save_every=1)
sampler.run(save_every=1, resume_state_path = 'logs/pmc_5.state')

which can be run with mpirun -np 2 python script.py or similar.

So for my purposes it turns out the current main branch is sufficient. If anything, it might be worth adding a little warning/note in the documentation. Feel free to close whenever, and thank you for looking into this.

minaskar commented 4 months ago

Thank you for your feedback. I just included a dill-friendly MPI pool into the main branch (you can read about it in the docs). Let me know if that also works for you.

ajdittmann commented 4 months ago

Thanks for the update, it is nice to not rely on external packages. The new pool works nicely.

minaskar / pocomc

Pickling errors after checkpointing (with MPI) #43