Open christy opened 2 years ago
In the Colab, I saw some different error messages, maybe these are helpful?
TypeError: <tf.Tensor 'default_policy/timestep_1:0' shape=() dtype=resource> is out of scope and cannot be used here. Use return values, explicit Python locals or TensorFlow collections to access it. Please see https://www.tensorflow.org/guide/function#all_outputs_of_a_tffunction_must_be_return_values for more information.
The tensor <tf.Tensor 'default_policy/timestep_1:0' shape=() dtype=resource> cannot be accessed from here, because it was defined in <tensorflow.python.framework.ops.Graph object at 0x7fe8e179f5d0>, which is out of scope.
Having this same issue. Any updates? Specifically this occurs with config: from ray.rllib.algorithms.dqn import DQNConfig but not with from ray.rllib.algorithms.ppo import PPOConfig. I can save checkpoint but restore fails with above error messages.
many thanks,
What happened + What you expected to happen
Description
I was trying to save the SlateQ Policy so I could Serve it. The save worked. But restore did not. Trying to restore the trained SlateQ policy from checkpoint file using trainer.restore() resulted in error with message (full error message is in the Output:):
Error detected in node 'default_policy/timestep_1' defined at: File "/Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/utils/framework.py", line 247, in get_variable
TypeError: <tf.Tensor 'default_policy/timestep_1:0' shape=() dtype=resource> is out of scope and cannot be used here. Use return values, explicit Python locals or TensorFlow collections to access it. Please see https://www.tensorflow.org/guide/function#all_outputs_of_a_tffunction_must_be_return_values for more information.
Steps to reproduce
Save a trained SlateQ model into a checkpoint file. This works.
checkpoint_file = slateq_trainer.save()
print("The checkpoint directory contains the following files:")
os.listdir(os.path.dirname(checkpoint_file))
Try to restore a Trainer from the checkpoint file. The
new_slateq_trainer = SlateQTrainer(config=slateq_config)
print(f"Before restoring: Trainer is at iteration={new_slateq_trainer.iteration}")
new_slateq_trainer.restore(checkpoint_file)
print(f"After restoring: Trainer is at iteration={new_slateq_trainer.iteration}")
What was expected:
new_slateq_trainer.restore(checkpoint_file) would fully restore the trained policy to the Trainer.
Output:
Trainer (at iteration 21 was saved in '/Users/christy/ray_results/SlateQTrainer_modified_lts_2022-04-21_17-27-329v5r_d2_/checkpoint_000021/checkpoint-21'!
The checkpoint directory contains the following files:['checkpoint-21', 'checkpoint-21.tune_metadata', '.is_checkpoint']
Before restoring: Trainer is at iteration=0
TypeError Traceback (most recent call last) Input In [29], in <cell line: 3>() 1 new_slateq_trainer = SlateQTrainer(config=slateq_config) 2 print(f"Before restoring: Trainer is at iteration={new_slateq_trainer.iteration}") ----> 3 new_slateq_trainer.restore(checkpoint_file) 4 print(f"After restoring: Trainer is at iteration={new_slateq_trainer.iteration}")
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/tune/trainable.py:529, in Trainable.restore(self, checkpoint_path) 527 self.load_checkpoint(checkpoint_dict) 528 else: --> 529 self.load_checkpoint(checkpoint_path) 530 self._time_since_restore = 0.0 531 self._timesteps_since_restore = 0
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/agents/trainer.py:2033, in Trainer.load_checkpoint(self, checkpoint_path) 2030 @override(Trainable) 2031 def load_checkpoint(self, checkpoint_path: str) -> None: 2032 extra_data = pickle.load(open(checkpoint_path, "rb")) -> 2033 self.setstate(extra_data)
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/agents/trainer.py:2675, in Trainer.setstate(self, state) 2673 def setstate(self, state: dict): 2674 if hasattr(self, "workers") and "worker" in state: -> 2675 self.workers.local_worker().restore(state["worker"]) 2676 remote_state = ray.put(state["worker"]) 2677 for r in self.workers.remote_workers():
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py:1496, in RolloutWorker.restore(self, objs) 1488 self.add_policy( 1489 policy_id=pid, 1490 policy_cls=pol_spec.policy_class, (...) 1493 config=pol_spec.config, 1494 ) 1495 else: -> 1496 self.policy_map[pid].set_state(state)
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/policy/tf_policy.py:528, in TFPolicy.set_state(self, state) 526 # Set exploration's state. 527 if hasattr(self, "exploration") and "_exploration_state" in state: --> 528 self.exploration.set_state( 529 state=state["_exploration_state"], sess=self.get_session() 530 ) 532 # Set the Policy's (NN) weights. 533 super().set_state(state)
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/utils/exploration/epsilon_greedy.py:241, in EpsilonGreedy.set_state(self, state, sess) 238 @override(Exploration) 239 def set_state(self, state: dict, sess: Optional["tf.Session"] = None) -> None: 240 if self.framework == "tf": --> 241 self.last_timestep.load(state["last_timestep"], session=sess) 242 elif isinstance(self.last_timestep, int): 243 self.last_timestep = state["last_timestep"]
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/util/deprecation.py:348, in deprecated..deprecated_wrapper..new_func(*args, *kwargs)
340 _PRINTED_WARNING[func] = True
341 logging.warning(
342 'From %s: %s (from %s) is deprecated and will be removed %s.\n'
343 'Instructions for updating:\n%s',
(...)
346 'in a future version' if date is None else ('after %s' % date),
347 instructions)
--> 348 return func(args, **kwargs)
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/variables.py:1025, in Variable.load(self, value, session) 992 """Load new value into this variable. 993 994 Writes new value to variable's memory. Doesn't add ops to the graph. (...) 1022 ValueError: Session is not passed and no default session 1023 """ 1024 if context.executing_eagerly(): -> 1025 self.assign(value) 1026 else: 1027 session = session or ops.get_default_session()
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/resource_variable_ops.py:915, in BaseResourceVariable.assign(self, value, use_locking, name, read_value) 910 tensor_name = " " + str(self.name) 911 raise ValueError( 912 (f"Cannot assign value to variable '{tensor_name}': Shape mismatch." 913 f"The variable shape {self._shape}, and the " 914 f"assigned value shape {value_tensor.shape} are incompatible.")) --> 915 assign_op = gen_resource_variable_ops.assign_variable_op( 916 self.handle, value_tensor, name=name) 917 if read_value: 918 return self._lazy_read(assign_op)
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/gen_resource_variable_ops.py:149, in assign_variable_op(resource, value, name) 147 pass 148 try: --> 149 return assign_variable_op_eager_fallback( 150 resource, value, name=name, ctx=_ctx) 151 except _core._SymbolicException: 152 pass # Add nodes to the TensorFlow graph.
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/ops/gen_resource_variable_ops.py:165, in assign_variable_op_eager_fallback(resource, value, name, ctx) 163 _inputs_flat = [resource, value] 164 _attrs = ("dtype", _attr_dtype) --> 165 _result = _execute.execute(b"AssignVariableOp", 0, inputs=_inputs_flat, 166 attrs=_attrs, ctx=ctx, name=name) 167 _result = None 168 return _result
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/eager/execute.py:72, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 68 if keras_symbolic_tensors: 69 raise core._SymbolicException( 70 "Inputs to eager execution function cannot be Keras symbolic " 71 "tensors, but found {}".format(keras_symbolic_tensors)) ---> 72 raise e 73 # pylint: enable=protected-access 74 return tensors
File ~/mambaforge/envs/rllib/lib/python3.9/site-packages/tensorflow/python/eager/execute.py:58, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 56 try: 57 ctx.ensure_initialized() ---> 58 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, 59 inputs, attrs, num_outputs) 60 except core._NotOkStatusException as e: 61 if name is not None:
TypeError: Originated from a graph execution error.
The graph execution error is detected at a node built at (most recent call last):
Error detected in node 'default_policy/timestep_1' defined at: File "/Users/christy/mambaforge/envs/rllib/lib/python3.9/site-packages/ray/rllib/utils/framework.py", line 247, in get_variable
TypeError: tf.Graph captured an external symbolic tensor. The symbolic tensor 'default_policy/timestep_1:0' created by node 'default_policy/timestep_1' is captured by the tf.Graph being executed as an input. But a tf.Graph is not allowed to take symbolic tensors from another graph as its inputs. Make sure all captured inputs of the executing tf.Graph are not symbolic tensors. Use return values, explicit Python locals or TensorFlow collections to access it. Please see https://www.tensorflow.org/guide/function#all_outputs_of_a_tffunction_must_be_return_values for more information.
Versions / Dependencies
On my Mac M1: Python 3.9.12 ray: 1.12.0 tf: 2.7.0
I also tried this in a Colab which gave the same error message: Python 3.7.13 ray: 1.12.0 tf: 2.8.0
Reproduction script
checkpoint_file = slateq_trainer.save() print(f"Trainer (at iteration {slateq_trainer.iteration} was saved in '{checkpoint_file}'!") print("The checkpoint directory contains the following files:") print(os.listdir(os.path.dirname(checkpoint_file)))
new_slateq_trainer = SlateQTrainer(config=slateq_config) print(f"Before restoring: Trainer is at iteration={new_slateq_trainer.iteration}") new_slateq_trainer.restore(checkpoint_file) print(f"After restoring: Trainer is at iteration={new_slateq_trainer.iteration}")
Issue Severity
High: It blocks me from completing my task.