vlimant / NNLO

GNU General Public License v3.0
2 stars 7 forks source link

coordinator checkpointing issue #11

Closed vlimant closed 5 years ago

vlimant commented 5 years ago
Traceback (most recent call last):
  File "OptimizationDriver.py", line 302, in <module>
    opt_coordinator.run(num_iterations=args.num_iterations)
  File "/lustre/atlas2/csc291/scratch/vlimant/homeDL/NNLO/coordinator.py", line 147, in run
    self.save()
  File "/lustre/atlas2/csc291/scratch/vlimant/homeDL/NNLO/coordinator.py", line 84, in save
    pickle.dump( self.__dict__, state )
TypeError: can't pickle mpi4py.MPI.Intracomm objects

@vloncar is self.comm supposed to be pickled ?

vloncar commented 5 years ago

No. It probably worked before due to the different MPI implementation of the communicator handle.

Can you try if the following code in coordinator.py would work:

def save(self, fn = None):
    if fn is None:
        fn = '{}-coordinator.state'.format(self.label)
    self.history.setdefault('save', fn)
    with open(fn, 'wb') as state:
        self_dict = self.__dict__
        self_dict.pop('comm') # Skip MPI objects (they are invalid)
        self_dict.pop('req_dict')
        pickle.dump( self_dict, state )

and in load() remove the two self_dict.pop() lines.

vlimant commented 5 years ago

fae5dc3057b25ea3256e141e93f48c51fd8edb2f