nnaisense / evotorch

Advanced evolutionary computation library built directly on top of PyTorch, created at NNAISENSE.
https://evotorch.ai
Apache License 2.0
1.01k stars 63 forks source link

Serialization error when fixing seed. #65

Closed mjrs33 closed 1 year ago

mjrs33 commented 1 year ago

Hello,

Is there a way to fix seed when num_actor is greater than 1 ?

The following code seems to give an error because torch._C.Generator cannot be serialized.

from evotorch import Problem
from evotorch.algorithms import SNES
from evotorch.logging import StdOutLogger
import torch

def sphere(x: torch.Tensor) -> torch.Tensor:
    return torch.sum(x.pow(2.0))

p = Problem(
    "min", sphere, solution_length=10, initial_bounds=(-1, 1),
    num_actors=4, seed=42
)
searcher = SNES(p, stdev_init=5)
logger = StdOutLogger(searcher)
searcher.step()

Error messages:

TypeError                                 Traceback (most recent call last)
File python/ray/_raylet.pyx:441, in ray._raylet.prepare_args_internal()

File ~/python/evotorch/lib/python3.9/site-packages/ray/_private/serialization.py:450, in SerializationContext.serialize(self, value)
    449 else:
--> 450     return self._serialize_to_msgpack(value)

File ~/python/evotorch/lib/python3.9/site-packages/ray/_private/serialization.py:428, in SerializationContext._serialize_to_msgpack(self, value)
    427     metadata = ray_constants.OBJECT_METADATA_TYPE_PYTHON
--> 428     pickle5_serialized_object = self._serialize_to_pickle5(
    429         metadata, python_objects
    430     )
    431 else:

File ~/python/evotorch/lib/python3.9/site-packages/ray/_private/serialization.py:390, in SerializationContext._serialize_to_pickle5(self, metadata, value)
    389     self.get_and_clear_contained_object_refs()
--> 390     raise e
    391 finally:

File ~/python/evotorch/lib/python3.9/site-packages/ray/_private/serialization.py:385, in SerializationContext._serialize_to_pickle5(self, metadata, value)
    384     self.set_in_band_serialization()
--> 385     inband = pickle.dumps(
    386         value, protocol=5, buffer_callback=writer.buffer_callback
    387     )
    388 except Exception as e:

File ~/python/evotorch/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle_fast.py:73, in dumps(obj, protocol, buffer_callback)
     70 cp = CloudPickler(
     71     file, protocol=protocol, buffer_callback=buffer_callback
     72 )
---> 73 cp.dump(obj)
     74 return file.getvalue()

File ~/python/evotorch/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle_fast.py:627, in CloudPickler.dump(self, obj)
    626 try:
--> 627     return Pickler.dump(self, obj)
    628 except RuntimeError as e:

File ~/python/evotorch/lib/python3.9/site-packages/evotorch/tools/cloning.py:285, in Serializable.__getstate__(self)
    284 memo = {id(self): self}
--> 285 return self._get_cloned_state(memo=memo)

File ~/python/evotorch/lib/python3.9/site-packages/evotorch/core.py:2565, in Problem._get_cloned_state(self, memo)
   2564             with _no_grad_if_basic_dtype(self.dtype):
-> 2565                 result[k] = deep_clone(
   2566                     v,
   2567                     otherwise_deepcopy=True,
   2568                     memo=memo,
   2569                 )
   2570 return result

File ~/python/evotorch/lib/python3.9/site-packages/evotorch/tools/cloning.py:185, in deep_clone(x, otherwise_deepcopy, otherwise_return, otherwise_fail, memo)
    184 if otherwise_deepcopy:
--> 185     result = deepcopy(x, memo=memo)
    186 elif otherwise_return:

File ~/tools/lib/python3.9/copy.py:161, in deepcopy(x, memo, _nil)
    160 if reductor is not None:
--> 161     rv = reductor(4)
    162 else:

TypeError: cannot pickle 'torch._C.Generator' object

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[497], line 15
     13 searcher = SNES(p, stdev_init=5)
     14 logger = StdOutLogger(searcher)
---> 15 searcher.step()

File ~/python/evotorch/lib/python3.9/site-packages/evotorch/algorithms/searchalgorithm.py:390, in SearchAlgorithm.step(self)
    387 if self._first_step_datetime is None:
    388     self._first_step_datetime = datetime.now()
--> 390 self._step()
    391 self._steps_count += 1
    392 self.update_status({"iter": self._steps_count})

File ~/python/evotorch/lib/python3.9/site-packages/evotorch/algorithms/distributed/gaussian.py:355, in GaussianSearchAlgorithm._step_non_distributed(self)
    350         self._population = SolutionBatch.cat(populations)
    352 if self._first_iter:
    353     # If we are computing the first generation, we just sample from our distribution and evaluate
    354     # the solutions.
--> 355     fill_and_eval_pop()
    356     self._first_iter = False
    357 else:
    358     # If we are computing next generations, then we need to compute the gradients of the last
    359     # generation, sample a new population, and evaluate the new population's solutions.

File ~/python/evotorch/lib/python3.9/site-packages/evotorch/algorithms/distributed/gaussian.py:296, in GaussianSearchAlgorithm._step_non_distributed.<locals>.fill_and_eval_pop()
    293     self._distribution.sample(out=self._population.access_values(), generator=self.problem)
    295     # Finally, here, the solutions are evaluated.
--> 296     self.problem.evaluate(self._population)
    297 else:
    298     # If num_interactions is not None, then this means that we have a threshold for the number
    299     # of simulator interactions to reach before declaring the phase of sampling complete.
   (...)
    304     # Therefore, to properly count the simulator interactions we made during this generation, we need
    305     # to get the interaction count before starting our sampling and evaluation operations.
    306     first_num_interactions = self.problem.status.get("total_interaction_count", 0)

File ~/python/evotorch/lib/python3.9/site-packages/evotorch/core.py:2386, in Problem.evaluate(self, x)
   2380 else:
   2381     raise TypeError(
   2382         f"The method `evaluate(...)` expected a Solution or a SolutionBatch as its argument."
   2383         f" However, the received object is {repr(x)}, which is of type {repr(type(x))}."
   2384     )
-> 2386 self._parallelize()
   2388 if self.is_main:
   2389     self.before_eval_hook(batch)

File ~/python/evotorch/lib/python3.9/site-packages/evotorch/core.py:1880, in Problem._parallelize(self)
   1878     actors = [EvaluationActor.remote(self, i, all_seeds[i], remote_states[i]) for i in range(number_of_actors)]
   1879 else:
-> 1880     actors = [
   1881         EvaluationActor.options(**config_per_actor).remote(self, i, all_seeds[i], remote_states[i])
   1882         for i in range(number_of_actors)
   1883     ]
   1885 self._actors = actors
   1886 self._actor_pool = ActorPool(self._actors)

File ~/python/evotorch/lib/python3.9/site-packages/evotorch/core.py:1881, in <listcomp>(.0)
   1878     actors = [EvaluationActor.remote(self, i, all_seeds[i], remote_states[i]) for i in range(number_of_actors)]
   1879 else:
   1880     actors = [
-> 1881         EvaluationActor.options(**config_per_actor).remote(self, i, all_seeds[i], remote_states[i])
   1882         for i in range(number_of_actors)
   1883     ]
   1885 self._actors = actors
   1886 self._actor_pool = ActorPool(self._actors)

File ~/python/evotorch/lib/python3.9/site-packages/ray/actor.py:639, in ActorClass.options.<locals>.ActorOptionWrapper.remote(self, *args, **kwargs)
    638 def remote(self, *args, **kwargs):
--> 639     return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)

File ~/python/evotorch/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py:387, in _tracing_actor_creation.<locals>._invocation_actor_class_remote_span(self, args, kwargs, *_args, **_kwargs)
    385 if not _is_tracing_enabled():
    386     assert "_ray_trace_ctx" not in kwargs
--> 387     return method(self, args, kwargs, *_args, **_kwargs)
    389 class_name = self.__ray_metadata__.class_name
    390 method_name = "__init__"

File ~/python/evotorch/lib/python3.9/site-packages/ray/actor.py:968, in ActorClass._remote(self, args, kwargs, **actor_options)
    958         func_name = meta.actor_creation_function_descriptor.function_name
    959     meta.actor_creation_function_descriptor = (
    960         cross_language._get_function_descriptor_for_actor_method(
    961             meta.language,
   (...)
    965         )
    966     )
--> 968 actor_id = worker.core_worker.create_actor(
    969     meta.language,
    970     meta.actor_creation_function_descriptor,
    971     creation_args,
    972     max_restarts,
    973     max_task_retries,
    974     resources,
    975     actor_placement_resources,
    976     max_concurrency,
    977     detached,
    978     name if name is not None else "",
    979     namespace if namespace is not None else "",
    980     is_asyncio,
    981     # Store actor_method_cpu in actor handle's extension data.
    982     extension_data=str(actor_method_cpu),
    983     serialized_runtime_env_info=serialized_runtime_env_info or "{}",
    984     concurrency_groups_dict=concurrency_groups_dict or dict(),
    985     max_pending_calls=max_pending_calls,
    986     scheduling_strategy=scheduling_strategy,
    987 )
    989 if _actor_launch_hook:
    990     _actor_launch_hook(
    991         meta.actor_creation_function_descriptor, resources, scheduling_strategy
    992     )

File python/ray/_raylet.pyx:1964, in ray._raylet.CoreWorker.create_actor()

File python/ray/_raylet.pyx:1969, in ray._raylet.CoreWorker.create_actor()

File python/ray/_raylet.pyx:407, in ray._raylet.prepare_args_and_increment_put_refs()

File python/ray/_raylet.pyx:398, in ray._raylet.prepare_args_and_increment_put_refs()

File python/ray/_raylet.pyx:449, in ray._raylet.prepare_args_internal()

TypeError: Could not serialize the argument <evotorch.core.Problem object at 0x2af78f01d0a0> for a task or actor evotorch.core.EvaluationActor.__init__. Check https://docs.ray.io/en/master/ray-core/objects/serialization.html#troubleshooting for more information.
engintoklu commented 1 year ago

Hello @mjrs33,

Thank you for trying out EvoTorch, and for your feedback!

Unfortunately, at the moment of writing this, we support manual seeding only when there are no remote actors. Reproducibility with parallelization is tricky. We are planning to support this in a future version.

mjrs33 commented 1 year ago

Hello @engintoklu

Thanks for your reply. I understand. I'm looking forward to this being resolved in a future version.

NaturalGradient commented 1 year ago

Marking this as resolved as it is on our radar for now but does not have a trivial fix that can be implemented. Thanks @mjrs33 for bringing this to our attention!