ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.72k stars 5.73k forks source link

[RLlib] SlateQ: Checkpoint restore fails #24110

Open christy opened 2 years ago

christy commented 2 years ago

What happened + What you expected to happen

Description

I was trying to save the SlateQ Policy so I could Serve it. The save worked. But restore did not. Trying to restore the trained SlateQ policy from checkpoint file using trainer.restore() resulted in error with message (full error message is in the Output:):

Error detected in node 'default_policy/timestep_1' defined at: File "/Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/utils/framework.py", line 247, in get_variable

TypeError: <tf.Tensor 'default_policy/timestep_1:0' shape=() dtype=resource> is out of scope and cannot be used here. Use return values, explicit Python locals or TensorFlow collections to access it. Please see https://www.tensorflow.org/guide/function#all_outputs_of_a_tffunction_must_be_return_values for more information.

Steps to reproduce

  1. Save a trained SlateQ model into a checkpoint file. This works. checkpoint_file = slateq_trainer.save() print("The checkpoint directory contains the following files:") os.listdir(os.path.dirname(checkpoint_file))

  2. Try to restore a Trainer from the checkpoint file. The new_slateq_trainer = SlateQTrainer(config=slateq_config) print(f"Before restoring: Trainer is at iteration={new_slateq_trainer.iteration}")
    new_slateq_trainer.restore(checkpoint_file) print(f"After restoring: Trainer is at iteration={new_slateq_trainer.iteration}")

What was expected:

new_slateq_trainer.restore(checkpoint_file) would fully restore the trained policy to the Trainer.

Output:

Trainer (at iteration 21 was saved in '/Users/christy/ray_results/SlateQTrainer_modified_lts_2022-04-21_17-27-329v5r_d2_/checkpoint_000021/checkpoint-21'! The checkpoint directory contains the following files: ['checkpoint-21', 'checkpoint-21.tune_metadata', '.is_checkpoint']

Before restoring: Trainer is at iteration=0


TypeError Traceback (most recent call last) Input In [29], in <cell line: 3>() 1 new_slateq_trainer = SlateQTrainer(config=slateq_config) 2 print(f"Before restoring: Trainer is at iteration={new_slateq_trainer.iteration}") ----> 3 new_slateq_trainer.restore(checkpoint_file) 4 print(f"After restoring: Trainer is at iteration={new_slateq_trainer.iteration}")

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/tune/trainable.py:529, in Trainable.restore(self, checkpoint_path) 527 self.load_checkpoint(checkpoint_dict) 528 else: --> 529 self.load_checkpoint(checkpoint_path) 530 self._time_since_restore = 0.0 531 self._timesteps_since_restore = 0

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/agents/trainer.py:2033, in Trainer.load_checkpoint(self, checkpoint_path) 2030 @override(Trainable) 2031 def load_checkpoint(self, checkpoint_path: str) -> None: 2032 extra_data = pickle.load(open(checkpoint_path, "rb")) -> 2033 self.setstate(extra_data)

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/agents/trainer.py:2675, in Trainer.setstate(self, state) 2673 def setstate(self, state: dict): 2674 if hasattr(self, "workers") and "worker" in state: -> 2675 self.workers.local_worker().restore(state["worker"]) 2676 remote_state = ray.put(state["worker"]) 2677 for r in self.workers.remote_workers():

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py:1496, in RolloutWorker.restore(self, objs) 1488 self.add_policy( 1489 policy_id=pid, 1490 policy_cls=pol_spec.policy_class, (...) 1493 config=pol_spec.config, 1494 ) 1495 else: -> 1496 self.policy_map[pid].set_state(state)

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/policy/tf_policy.py:528, in TFPolicy.set_state(self, state) 526 # Set exploration's state. 527 if hasattr(self, "exploration") and "_exploration_state" in state: --> 528 self.exploration.set_state( 529 state=state["_exploration_state"], sess=self.get_session() 530 ) 532 # Set the Policy's (NN) weights. 533 super().set_state(state)

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/utils/exploration/epsilon_greedy.py:241, in EpsilonGreedy.set_state(self, state, sess) 238 @override(Exploration) 239 def set_state(self, state: dict, sess: Optional["tf.Session"] = None) -> None: 240 if self.framework == "tf": --> 241 self.last_timestep.load(state["last_timestep"], session=sess) 242 elif isinstance(self.last_timestep, int): 243 self.last_timestep = state["last_timestep"]

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/util/deprecation.py:348, in deprecated..deprecated_wrapper..new_func(*args, *kwargs) 340 _PRINTED_WARNING[func] = True 341 logging.warning( 342 'From %s: %s (from %s) is deprecated and will be removed %s.\n' 343 'Instructions for updating:\n%s', (...) 346 'in a future version' if date is None else ('after %s' % date), 347 instructions) --> 348 return func(args, **kwargs)

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/variables.py:1025, in Variable.load(self, value, session) 992 """Load new value into this variable. 993 994 Writes new value to variable's memory. Doesn't add ops to the graph. (...) 1022 ValueError: Session is not passed and no default session 1023 """ 1024 if context.executing_eagerly(): -> 1025 self.assign(value) 1026 else: 1027 session = session or ops.get_default_session()

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/resource_variable_ops.py:915, in BaseResourceVariable.assign(self, value, use_locking, name, read_value) 910 tensor_name = " " + str(self.name) 911 raise ValueError( 912 (f"Cannot assign value to variable '{tensor_name}': Shape mismatch." 913 f"The variable shape {self._shape}, and the " 914 f"assigned value shape {value_tensor.shape} are incompatible.")) --> 915 assign_op = gen_resource_variable_ops.assign_variable_op( 916 self.handle, value_tensor, name=name) 917 if read_value: 918 return self._lazy_read(assign_op)

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/gen_resource_variable_ops.py:149, in assign_variable_op(resource, value, name) 147 pass 148 try: --> 149 return assign_variable_op_eager_fallback( 150 resource, value, name=name, ctx=_ctx) 151 except _core._SymbolicException: 152 pass # Add nodes to the TensorFlow graph.

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/gen_resource_variable_ops.py:165, in assign_variable_op_eager_fallback(resource, value, name, ctx) 163 _inputs_flat = [resource, value] 164 _attrs = ("dtype", _attr_dtype) --> 165 _result = _execute.execute(b"AssignVariableOp", 0, inputs=_inputs_flat, 166 attrs=_attrs, ctx=ctx, name=name) 167 _result = None 168 return _result

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/eager/execute.py:72, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 68 if keras_symbolic_tensors: 69 raise core._SymbolicException( 70 "Inputs to eager execution function cannot be Keras symbolic " 71 "tensors, but found {}".format(keras_symbolic_tensors)) ---> 72 raise e 73 # pylint: enable=protected-access 74 return tensors

File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/eager/execute.py:58, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 56 try: 57 ctx.ensure_initialized() ---> 58 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, 59 inputs, attrs, num_outputs) 60 except core._NotOkStatusException as e: 61 if name is not None:

TypeError: Originated from a graph execution error.

The graph execution error is detected at a node built at (most recent call last):

File /Users/christy/mambaforge/envs/rllib/lib/python3.9/runpy.py, line 197, in _run_module_as_main File /Users/christy/mambaforge/envs/rllib/lib/python3.9/runpy.py, line 87, in _run_code File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ipykernel_launcher.py, line 17, in File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/traitlets/config/application.py, line 846, in launch_instance File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ipykernel/kernelapp.py, line 712, in start File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tornado/platform/asyncio.py, line 199, in start File /Users/christy/mambaforge/envs/rllib/lib/python3.9/asyncio/base_events.py, line 601, in run_forever File /Users/christy/mambaforge/envs/rllib/lib/python3.9/asyncio/base_events.py, line 1905, in _run_once File /Users/christy/mambaforge/envs/rllib/lib/python3.9/asyncio/events.py, line 80, in _run File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ipykernel/kernelbase.py, line 504, in dispatch_queue File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ipykernel/kernelbase.py, line 493, in process_one File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ipykernel/kernelbase.py, line 400, in dispatch_shell File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ipykernel/kernelbase.py, line 724, in execute_request File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ipykernel/ipkernel.py, line 390, in do_execute File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ipykernel/zmqshell.py, line 528, in run_cell File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/IPython/core/interactiveshell.py, line 2863, in run_cell File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/IPython/core/interactiveshell.py, line 2909, in _run_cell File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/IPython/core/async_helpers.py, line 129, in _pseudo_sync_runner File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/IPython/core/interactiveshell.py, line 3106, in run_cell_async File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/IPython/core/interactiveshell.py, line 3309, in run_ast_nodes File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/IPython/core/interactiveshell.py, line 3369, in run_code File /var/folders/0g/jfs_l_113_356_c0rfp4jd8c0000gn/T/ipykernel_68581/3893148471.py, line 1, in <cell line: 1> File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/agents/trainer.py, line 830, in init File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/tune/trainable.py, line 149, in init File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/agents/trainer.py, line 911, in setup File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py, line 162, in init File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py, line 567, in _make_worker File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py, line 626, in init File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py, line 1722, in _build_policy_map File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/policy/policy_map.py, line 140, in create_policy File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/policy/tf_policy_template.py, line 256, in init File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/policy/dynamic_tf_policy.py, line 291, in init File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/policy/policy.py, line 811, in _create_exploration File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/utils/from_config.py, line 195, in from_config File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/utils/exploration/epsilon_greedy.py, line 73, in init File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/utils/framework.py, line 247, in get_variable File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/variable_scope.py, line 1579, in get_variable File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/variable_scope.py, line 1322, in get_variable File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/variable_scope.py, line 578, in get_variable File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/variable_scope.py, line 531, in _true_getter File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/variable_scope.py, line 952, in _get_single_variable File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py, line 150, in error_handler File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/variables.py, line 268, in call File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/variables.py, line 213, in _variable_v1_call File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/variables.py, line 206, in File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/variable_scope.py, line 2612, in default_variable_creator File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py, line 150, in error_handler File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/variables.py, line 272, in call File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/resource_variable_ops.py, line 1630, in init File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/resource_variable_ops.py, line 1792, in _init_from_args File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/resource_variable_ops.py, line 238, in eager_safe_variable_handle File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/resource_variable_ops.py, line 162, in _variable_handle_from_shape_and_dtype File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/gen_resource_variable_ops.py, line 1203, in var_handle_op File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/framework/op_def_library.py, line 744, in _apply_op_helper File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/framework/ops.py, line 3697, in _create_op_internal File /Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/framework/ops.py, line 2101, in init

Error detected in node 'default_policy/timestep_1' defined at: File "/Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/utils/framework.py", line 247, in get_variable

TypeError: tf.Graph captured an external symbolic tensor. The symbolic tensor 'default_policy/timestep_1:0' created by node 'default_policy/timestep_1' is captured by the tf.Graph being executed as an input. But a tf.Graph is not allowed to take symbolic tensors from another graph as its inputs. Make sure all captured inputs of the executing tf.Graph are not symbolic tensors. Use return values, explicit Python locals or TensorFlow collections to access it. Please see https://www.tensorflow.org/guide/function#all_outputs_of_a_tffunction_must_be_return_values for more information.

Versions / Dependencies

On my Mac M1: Python 3.9.12 ray: 1.12.0 tf: 2.7.0

I also tried this in a Colab which gave the same error message: Python 3.7.13 ray: 1.12.0 tf: 2.8.0

Reproduction script

checkpoint_file = slateq_trainer.save() print(f"Trainer (at iteration {slateq_trainer.iteration} was saved in '{checkpoint_file}'!") print("The checkpoint directory contains the following files:") print(os.listdir(os.path.dirname(checkpoint_file)))

new_slateq_trainer = SlateQTrainer(config=slateq_config) print(f"Before restoring: Trainer is at iteration={new_slateq_trainer.iteration}") new_slateq_trainer.restore(checkpoint_file) print(f"After restoring: Trainer is at iteration={new_slateq_trainer.iteration}")

Issue Severity

High: It blocks me from completing my task.

christy commented 2 years ago

In the Colab, I saw some different error messages, maybe these are helpful?

TypeError: <tf.Tensor 'default_policy/timestep_1:0' shape=() dtype=resource> is out of scope and cannot be used here. Use return values, explicit Python locals or TensorFlow collections to access it. Please see https://www.tensorflow.org/guide/function#all_outputs_of_a_tffunction_must_be_return_values for more information.

The tensor <tf.Tensor 'default_policy/timestep_1:0' shape=() dtype=resource> cannot be accessed from here, because it was defined in <tensorflow.python.framework.ops.Graph object at 0x7fe8e179f5d0>, which is out of scope.

tmontana commented 1 year ago

Having this same issue. Any updates? Specifically this occurs with config: from ray.rllib.algorithms.dqn import DQNConfig but not with from ray.rllib.algorithms.ppo import PPOConfig. I can save checkpoint but restore fails with above error messages.

many thanks,