ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.61k stars 5.71k forks source link

[RLlib][Core] Attempting to perform BLAS operation using StreamExecutor without BLAS support #34446

Open simonsays1980 opened 1 year ago

simonsays1980 commented 1 year ago

What happened + What you expected to happen

What happened

I start a GPU cluster (see the script below) via the Autoscaler on GCP and then run a simple script using RLlib (and Tune). I get an error that basically tells me:

Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]

See below for the complete error output. Usually reason for this: no GPU memory growth. But this is enabled by default in RLlib. Also, I use the default model in RLlib that should not be that big in memory that it occupies the whole VRAM (though the trainer process does claim a lot of it). The script runs, if I do not fraction the GPU.

Maybe some experience could be shared here from the Ray development team in regard to GPU fractioning and VRAM model sizes (or best practices).

What I expected to happen

That the example script runs through seemlessly with this small model and two workers only on a NVIDIA Tesla V100 with around 16GB VRAM.

Here is the complete error message:

Job submission server address: http://127.0.0.1:8265
Warning: Gym version v0.24.0 has a number of critical issues with `gym.make` such that the `reset` and `step` functions are called before returning the environment. It is recommend to downgrading to v0.23.1 or upgrading to v0.25.1
/home/ray/anaconda3/lib/python3.9/site-packages/flatbuffers/compat.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
/home/ray/anaconda3/lib/python3.9/site-packages/botocore/vendored/requests/packages/urllib3/_collections.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
  from collections import Mapping, MutableMapping
/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:36: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
  'nearest': pil_image.NEAREST,
/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:37: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  'bilinear': pil_image.BILINEAR,
/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  'bicubic': pil_image.BICUBIC,
/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:39: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
  'hamming': pil_image.HAMMING,
/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:40: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
  'box': pil_image.BOX,
/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:41: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
  'lanczos': pil_image.LANCZOS,
/home/ray/anaconda3/lib/python3.9/site-packages/tensorflow_probability/python/__init__.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if (distutils.version.LooseVersion(tf.__version__) <
2023-04-15 02:58:08,208 INFO worker.py:1230 -- Using address 10.138.0.31:6379 set in the environment variable RAY_ADDRESS
2023-04-15 02:58:08,208 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 10.138.0.31:6379...
2023-04-15 02:58:08,219 INFO worker.py:1529 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
/home/ray/anaconda3/lib/python3.9/site-packages/jupyter_client/connect.py:27: DeprecationWarning: Jupyter is migrating its paths to use standard platformdirs
given by the platformdirs library.  To remove this warning and
see the appropriate new directories, set the environment variable
`JUPYTER_PLATFORM_DIRS=1` and then run `jupyter --paths`.
The use of platformdirs will be the default in `jupyter_core` v6
  from jupyter_core.paths import jupyter_data_dir
2023-04-15 02:58:08,747 INFO algorithm_config.py:2492 -- Executing eagerly (framework='tf2'), with eager_tracing=tf2. For production workloads, make sure to set eager_tracing=True  in order to match the speed of tf-static-graph (framework='tf'). For debugging purposes, `eager_tracing=False` is the best choice.
2023-04-15 02:58:08,748 INFO algorithm_config.py:2492 -- Executing eagerly (framework='tf2'), with eager_tracing=tf2. For production workloads, make sure to set eager_tracing=True  in order to match the speed of tf-static-graph (framework='tf'). For debugging purposes, `eager_tracing=False` is the best choice.
(pid=931) Warning: Gym version v0.24.0 has a number of critical issues with `gym.make` such that the `reset` and `step` functions are called before returning the environment. It is recommend to downgrading to v0.23.1 or upgrading to v0.25.1
(pid=931) /home/ray/anaconda3/lib/python3.9/site-packages/flatbuffers/compat.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
(pid=931)   import imp
(pid=931) /home/ray/anaconda3/lib/python3.9/site-packages/botocore/vendored/requests/packages/urllib3/_collections.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
(pid=931)   from collections import Mapping, MutableMapping
(pid=931) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:36: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
(pid=931)   'nearest': pil_image.NEAREST,
(pid=931) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:37: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
(pid=931)   'bilinear': pil_image.BILINEAR,
(pid=931) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
(pid=931)   'bicubic': pil_image.BICUBIC,
(pid=931) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:39: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
(pid=931)   'hamming': pil_image.HAMMING,
(pid=931) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:40: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
(pid=931)   'box': pil_image.BOX,
(pid=931) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:41: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
(pid=931)   'lanczos': pil_image.LANCZOS,
(pid=931) /home/ray/anaconda3/lib/python3.9/site-packages/tensorflow_probability/python/__init__.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(pid=931)   if (distutils.version.LooseVersion(tf.__version__) <
(PPO pid=931) 2023-04-15 02:58:17,303   WARNING algorithm_config.py:488 -- Cannot create PPOConfig from given `config_dict`! Property __stdout_file__ not supported.
(PPO pid=931) 2023-04-15 02:58:17,303   INFO algorithm_config.py:2492 -- Executing eagerly (framework='tf2'), with eager_tracing=tf2. For production workloads, make sure to set eager_tracing=True  in order to match the speed of tf-static-graph (framework='tf'). For debugging purposes, `eager_tracing=False` is the best choice.
(pid=989) Warning: Gym version v0.24.0 has a number of critical issues with `gym.make` such that the `reset` and `step` functions are called before returning the environment. It is recommend to downgrading to v0.23.1 or upgrading to v0.25.1
(pid=988) Warning: Gym version v0.24.0 has a number of critical issues with `gym.make` such that the `reset` and `step` functions are called before returning the environment. It is recommend to downgrading to v0.23.1 or upgrading to v0.25.1
(pid=989) /home/ray/anaconda3/lib/python3.9/site-packages/flatbuffers/compat.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
(pid=989)   import imp
(pid=988) /home/ray/anaconda3/lib/python3.9/site-packages/flatbuffers/compat.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
(pid=988)   import imp
(pid=989) /home/ray/anaconda3/lib/python3.9/site-packages/botocore/vendored/requests/packages/urllib3/_collections.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
(pid=989)   from collections import Mapping, MutableMapping
(pid=988) /home/ray/anaconda3/lib/python3.9/site-packages/botocore/vendored/requests/packages/urllib3/_collections.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
(pid=988)   from collections import Mapping, MutableMapping
(pid=989) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:36: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
(pid=989)   'nearest': pil_image.NEAREST,
(pid=989) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:37: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
(pid=989)   'bilinear': pil_image.BILINEAR,
(pid=989) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
(pid=989)   'bicubic': pil_image.BICUBIC,
(pid=989) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:39: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
(pid=989)   'hamming': pil_image.HAMMING,
(pid=989) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:40: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
(pid=989)   'box': pil_image.BOX,
(pid=989) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:41: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
(pid=989)   'lanczos': pil_image.LANCZOS,
(pid=988) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:36: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
(pid=988)   'nearest': pil_image.NEAREST,
(pid=988) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:37: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
(pid=988)   'bilinear': pil_image.BILINEAR,
(pid=988) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
(pid=988)   'bicubic': pil_image.BICUBIC,
(pid=988) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:39: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
(pid=988)   'hamming': pil_image.HAMMING,
(pid=988) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:40: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
(pid=988)   'box': pil_image.BOX,
(pid=988) /home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/image_utils.py:41: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
(pid=988)   'lanczos': pil_image.LANCZOS,
(pid=989) /home/ray/anaconda3/lib/python3.9/site-packages/tensorflow_probability/python/__init__.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(pid=989)   if (distutils.version.LooseVersion(tf.__version__) <
(pid=988) /home/ray/anaconda3/lib/python3.9/site-packages/tensorflow_probability/python/__init__.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
(pid=988)   if (distutils.version.LooseVersion(tf.__version__) <
(RolloutWorker pid=989) /home/ray/anaconda3/lib/python3.9/site-packages/gym/core.py:200: DeprecationWarning: WARN: Function `env.seed(seed)` is marked as deprecated and will be removed in the future. Please use `env.reset(seed=seed)` instead.
(RolloutWorker pid=989)   deprecation(
(RolloutWorker pid=988) 2023-04-15 02:58:24,499 WARNING env.py:147 -- Your env doesn't have a .spec.max_episode_steps attribute. This is fine if you have set 'horizon' in your config dictionary, or `soft_horizon`. However, if you haven't, 'horizon' will default to infinity, and your environment will not be reset.
(RolloutWorker pid=988) /home/ray/anaconda3/lib/python3.9/site-packages/gym/core.py:200: DeprecationWarning: WARN: Function `env.seed(seed)` is marked as deprecated and will be removed in the future. Please use `env.reset(seed=seed)` instead.
(RolloutWorker pid=988)   deprecation(
(RolloutWorker pid=989) 2023-04-15 02:58:25,983 ERROR worker.py:763 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=989, ip=10.138.0.31, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f830266a670>)
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 712, in __init__
(RolloutWorker pid=989)     self._build_policy_map(
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1970, in _build_policy_map
(RolloutWorker pid=989)     self.policy_map.create_policy(
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy_map.py", line 146, in create_policy
(RolloutWorker pid=989)     policy = create_policy_for_framework(
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/policy.py", line 123, in create_policy_for_framework
(RolloutWorker pid=989)     return policy_class(observation_space, action_space, merged_config)
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 158, in __init__
(RolloutWorker pid=989)     super(TracedEagerPolicy, self).__init__(*args, **kwargs)
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/ppo_tf_policy.py", line 102, in __init__
(RolloutWorker pid=989)     self.maybe_initialize_optimizer_and_loss()
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 424, in maybe_initialize_optimizer_and_loss
(RolloutWorker pid=989)     self._initialize_loss_from_dummy_batch(
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy.py", line 1261, in _initialize_loss_from_dummy_batch
(RolloutWorker pid=989)     actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 133, in _func
(RolloutWorker pid=989)     return obj(self_, *args, **kwargs)
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 185, in compute_actions_from_input_dict
(RolloutWorker pid=989)     return super(TracedEagerPolicy, self).compute_actions_from_input_dict(
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 461, in compute_actions_from_input_dict
(RolloutWorker pid=989)     ret = self._compute_actions_helper(
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
(RolloutWorker pid=989)     return func(self, *a, **k)
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 818, in _compute_actions_helper
(RolloutWorker pid=989)     dist_inputs, state_out = self.model(
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/modelv2.py", line 259, in __call__
(RolloutWorker pid=989)     res = self.forward(restored, state or [], seq_lens)
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/tf/fcnet.py", line 148, in forward
(RolloutWorker pid=989)     model_out, self._value_out = self.base_model(input_dict["obs_flat"])
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
(RolloutWorker pid=989)     raise e.with_traceback(filtered_tb) from None
(RolloutWorker pid=989)   File "/home/ray/anaconda3/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7164, in raise_from_not_ok_status
(RolloutWorker pid=989)     raise core._status_to_exception(e) from None  # pylint: disable=protected-access
(RolloutWorker pid=989) tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer "fc_value_1" (type Dense).
(RolloutWorker pid=989) 
(RolloutWorker pid=989) Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]
(RolloutWorker pid=989) 
(RolloutWorker pid=989) Call arguments received by layer "fc_value_1" (type Dense):
(RolloutWorker pid=989)   • inputs=tf.Tensor(shape=(32, 57), dtype=float32)
(RolloutWorker pid=988) 2023-04-15 02:58:26,006 ERROR worker.py:763 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=988, ip=10.138.0.31, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7fb45cf24700>)
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 712, in __init__
(RolloutWorker pid=988)     self._build_policy_map(
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1970, in _build_policy_map
(RolloutWorker pid=988)     self.policy_map.create_policy(
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy_map.py", line 146, in create_policy
(RolloutWorker pid=988)     policy = create_policy_for_framework(
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/policy.py", line 123, in create_policy_for_framework
(RolloutWorker pid=988)     return policy_class(observation_space, action_space, merged_config)
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 158, in __init__
(RolloutWorker pid=988)     super(TracedEagerPolicy, self).__init__(*args, **kwargs)
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/ppo_tf_policy.py", line 102, in __init__
(RolloutWorker pid=988)     self.maybe_initialize_optimizer_and_loss()
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 424, in maybe_initialize_optimizer_and_loss
(RolloutWorker pid=988)     self._initialize_loss_from_dummy_batch(
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy.py", line 1261, in _initialize_loss_from_dummy_batch
(RolloutWorker pid=988)     actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 133, in _func
(RolloutWorker pid=988)     return obj(self_, *args, **kwargs)
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 185, in compute_actions_from_input_dict
(RolloutWorker pid=988)     return super(TracedEagerPolicy, self).compute_actions_from_input_dict(
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 461, in compute_actions_from_input_dict
(RolloutWorker pid=988)     ret = self._compute_actions_helper(
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
(RolloutWorker pid=988)     return func(self, *a, **k)
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 818, in _compute_actions_helper
(RolloutWorker pid=988)     dist_inputs, state_out = self.model(
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/modelv2.py", line 259, in __call__
(RolloutWorker pid=988)     res = self.forward(restored, state or [], seq_lens)
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/tf/fcnet.py", line 148, in forward
(RolloutWorker pid=988)     model_out, self._value_out = self.base_model(input_dict["obs_flat"])
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
(RolloutWorker pid=988)     raise e.with_traceback(filtered_tb) from None
(RolloutWorker pid=988)   File "/home/ray/anaconda3/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7164, in raise_from_not_ok_status
(RolloutWorker pid=988)     raise core._status_to_exception(e) from None  # pylint: disable=protected-access
(RolloutWorker pid=988) tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer "fc_value_1" (type Dense).
(RolloutWorker pid=988) 
(RolloutWorker pid=988) Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]
(RolloutWorker pid=988) 
(RolloutWorker pid=988) Call arguments received by layer "fc_value_1" (type Dense):
(RolloutWorker pid=988)   • inputs=tf.Tensor(shape=(32, 57), dtype=float32)
(PPO pid=931) 2023-04-15 02:58:26,012   ERROR actor_manager.py:486 -- Ray error, taking actor 1 out of service. The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=988, ip=10.138.0.31, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7fb45cf24700>)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 712, in __init__
(PPO pid=931)     self._build_policy_map(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1970, in _build_policy_map
(PPO pid=931)     self.policy_map.create_policy(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy_map.py", line 146, in create_policy
(PPO pid=931)     policy = create_policy_for_framework(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/policy.py", line 123, in create_policy_for_framework
(PPO pid=931)     return policy_class(observation_space, action_space, merged_config)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 158, in __init__
(PPO pid=931)     super(TracedEagerPolicy, self).__init__(*args, **kwargs)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/ppo_tf_policy.py", line 102, in __init__
(PPO pid=931)     self.maybe_initialize_optimizer_and_loss()
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 424, in maybe_initialize_optimizer_and_loss
(PPO pid=931)     self._initialize_loss_from_dummy_batch(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy.py", line 1261, in _initialize_loss_from_dummy_batch
(PPO pid=931)     actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 133, in _func
(PPO pid=931)     return obj(self_, *args, **kwargs)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 185, in compute_actions_from_input_dict
(PPO pid=931)     return super(TracedEagerPolicy, self).compute_actions_from_input_dict(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 461, in compute_actions_from_input_dict
(PPO pid=931)     ret = self._compute_actions_helper(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
(PPO pid=931)     return func(self, *a, **k)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 818, in _compute_actions_helper
(PPO pid=931)     dist_inputs, state_out = self.model(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/modelv2.py", line 259, in __call__
(PPO pid=931)     res = self.forward(restored, state or [], seq_lens)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/tf/fcnet.py", line 148, in forward
(PPO pid=931)     model_out, self._value_out = self.base_model(input_dict["obs_flat"])
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
(PPO pid=931)     raise e.with_traceback(filtered_tb) from None
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7164, in raise_from_not_ok_status
(PPO pid=931)     raise core._status_to_exception(e) from None  # pylint: disable=protected-access
(PPO pid=931) tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer "fc_value_1" (type Dense).
(PPO pid=931) 
(PPO pid=931) Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]
(PPO pid=931) 
(PPO pid=931) Call arguments received by layer "fc_value_1" (type Dense):
(PPO pid=931)   • inputs=tf.Tensor(shape=(32, 57), dtype=float32)
(PPO pid=931) 2023-04-15 02:58:26,014   ERROR actor_manager.py:486 -- Ray error, taking actor 2 out of service. The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=989, ip=10.138.0.31, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f830266a670>)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 712, in __init__
(PPO pid=931)     self._build_policy_map(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1970, in _build_policy_map
(PPO pid=931)     self.policy_map.create_policy(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy_map.py", line 146, in create_policy
(PPO pid=931)     policy = create_policy_for_framework(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/policy.py", line 123, in create_policy_for_framework
(PPO pid=931)     return policy_class(observation_space, action_space, merged_config)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 158, in __init__
(PPO pid=931)     super(TracedEagerPolicy, self).__init__(*args, **kwargs)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/ppo_tf_policy.py", line 102, in __init__
(PPO pid=931)     self.maybe_initialize_optimizer_and_loss()
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 424, in maybe_initialize_optimizer_and_loss
(PPO pid=931)     self._initialize_loss_from_dummy_batch(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy.py", line 1261, in _initialize_loss_from_dummy_batch
(PPO pid=931)     actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 133, in _func
(PPO pid=931)     return obj(self_, *args, **kwargs)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 185, in compute_actions_from_input_dict
(PPO pid=931)     return super(TracedEagerPolicy, self).compute_actions_from_input_dict(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 461, in compute_actions_from_input_dict
(PPO pid=931)     ret = self._compute_actions_helper(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
(PPO pid=931)     return func(self, *a, **k)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 818, in _compute_actions_helper
(PPO pid=931)     dist_inputs, state_out = self.model(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/modelv2.py", line 259, in __call__
(PPO pid=931)     res = self.forward(restored, state or [], seq_lens)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/tf/fcnet.py", line 148, in forward
(PPO pid=931)     model_out, self._value_out = self.base_model(input_dict["obs_flat"])
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
(PPO pid=931)     raise e.with_traceback(filtered_tb) from None
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7164, in raise_from_not_ok_status
(PPO pid=931)     raise core._status_to_exception(e) from None  # pylint: disable=protected-access
(PPO pid=931) tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer "fc_value_1" (type Dense).
(PPO pid=931) 
(PPO pid=931) Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]
(PPO pid=931) 
(PPO pid=931) Call arguments received by layer "fc_value_1" (type Dense):
(PPO pid=931)   • inputs=tf.Tensor(shape=(32, 57), dtype=float32)
== Status ==
Current time: 2023-04-15 02:58:26 (running for 00:00:17.29)
Memory usage on this node: 5.3/76.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 3.0/12 CPUs, 0.777/1 GPUs, 0.0/45.86 GiB heap, 0.0/22.93 GiB objects
Result logdir: /home/ray/ray_results/TestPPORandomEnvRegAction/PPO
Number of trials: 1/1 (1 RUNNING)

2023-04-15 02:58:26,032 ERROR trial_runner.py:1088 -- Trial PPO_random_env_0621a_00000: Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/execution/ray_trial_executor.py", line 1070, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2311, in get
    raise value
  File "python/ray/_raylet.pyx", line 1135, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 1045, in ray._raylet.execute_task_with_cancellation_handler
  File "python/ray/_raylet.pyx", line 782, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 945, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 599, in ray._raylet.store_task_errors
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=931, ip=10.138.0.31, repr=PPO)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 441, in __init__
    super().__init__(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 169, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 566, in setup
    self.workers = WorkerSet(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 191, in __init__
    raise e.args[0].args[2]
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 712, in __init__
    self._build_policy_map(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1970, in _build_policy_map
    self.policy_map.create_policy(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy_map.py", line 146, in create_policy
    policy = create_policy_for_framework(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/policy.py", line 123, in create_policy_for_framework
    return policy_class(observation_space, action_space, merged_config)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 158, in __init__
    super(TracedEagerPolicy, self).__init__(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/ppo_tf_policy.py", line 102, in __init__
    self.maybe_initialize_optimizer_and_loss()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 424, in maybe_initialize_optimizer_and_loss
    self._initialize_loss_from_dummy_batch(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy.py", line 1261, in _initialize_loss_from_dummy_batch
    actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 133, in _func
    return obj(self_, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 185, in compute_actions_from_input_dict
    return super(TracedEagerPolicy, self).compute_actions_from_input_dict(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 461, in compute_actions_from_input_dict
    ret = self._compute_actions_helper(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 818, in _compute_actions_helper
    dist_inputs, state_out = self.model(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/modelv2.py", line 259, in __call__
    res = self.forward(restored, state or [], seq_lens)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/tf/fcnet.py", line 148, in forward
    model_out, self._value_out = self.base_model(input_dict["obs_flat"])
  File "/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ray/anaconda3/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7164, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer "fc_value_1" (type Dense).

Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]

Call arguments received by layer "fc_value_1" (type Dense):
  • inputs=tf.Tensor(shape=(32, 57), dtype=float32)

== Status ==
Current time: 2023-04-15 02:58:26 (running for 00:00:17.30)
Memory usage on this node: 5.3/76.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/45.86 GiB heap, 0.0/22.93 GiB objects
Result logdir: /home/ray/ray_results/TestPPORandomEnvRegAction/PPO
Number of trials: 1/1 (1 ERROR)
Number of errored trials: 1
+----------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 |   # failures | error file                                                                                                                            |
|----------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------|
| PPO_random_env_0621a_00000 |            1 | /home/ray/ray_results/TestPPORandomEnvRegAction/PPO/PPO_random_env_0621a_00000_0_training_iteration=526_2023-04-15_02-58-10/error.txt |
+----------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------+

== Status ==
Current time: 2023-04-15 02:58:26 (running for 00:00:17.30)
Memory usage on this node: 5.3/76.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/45.86 GiB heap, 0.0/22.93 GiB objects
Result logdir: /home/ray/ray_results/TestPPORandomEnvRegAction/PPO
Number of trials: 1/1 (1 ERROR)
Number of errored trials: 1
+----------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 |   # failures | error file                                                                                                                            |
|----------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------|
| PPO_random_env_0621a_00000 |            1 | /home/ray/ray_results/TestPPORandomEnvRegAction/PPO/PPO_random_env_0621a_00000_0_training_iteration=526_2023-04-15_02-58-10/error.txt |
+----------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------+

2023-04-15 02:58:26,046 ERROR ray_trial_executor.py:118 -- An exception occurred when trying to stop the Ray actor:Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/execution/ray_trial_executor.py", line 109, in _post_stop_cleanup
    ray.get(future, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2311, in get
    raise value
  File "python/ray/_raylet.pyx", line 1135, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 1045, in ray._raylet.execute_task_with_cancellation_handler
  File "python/ray/_raylet.pyx", line 782, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 945, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 599, in ray._raylet.store_task_errors
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=931, ip=10.138.0.31, repr=PPO)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 441, in __init__
    super().__init__(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 169, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 566, in setup
    self.workers = WorkerSet(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 191, in __init__
    raise e.args[0].args[2]
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 712, in __init__
    self._build_policy_map(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1970, in _build_policy_map
    self.policy_map.create_policy(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy_map.py", line 146, in create_policy
    policy = create_policy_for_framework(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/policy.py", line 123, in create_policy_for_framework
    return policy_class(observation_space, action_space, merged_config)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 158, in __init__
    super(TracedEagerPolicy, self).__init__(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/ppo_tf_policy.py", line 102, in __init__
    self.maybe_initialize_optimizer_and_loss()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 424, in maybe_initialize_optimizer_and_loss
    self._initialize_loss_from_dummy_batch(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy.py", line 1261, in _initialize_loss_from_dummy_batch
    actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 133, in _func
    return obj(self_, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 185, in compute_actions_from_input_dict
    return super(TracedEagerPolicy, self).compute_actions_from_input_dict(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 461, in compute_actions_from_input_dict
    ret = self._compute_actions_helper(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 818, in _compute_actions_helper
    dist_inputs, state_out = self.model(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/modelv2.py", line 259, in __call__
    res = self.forward(restored, state or [], seq_lens)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/tf/fcnet.py", line 148, in forward
    model_out, self._value_out = self.base_model(input_dict["obs_flat"])
  File "/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ray/anaconda3/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7164, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer "fc_value_1" (type Dense).

Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]

Call arguments received by layer "fc_value_1" (type Dense):
  • inputs=tf.Tensor(shape=(32, 57), dtype=float32)

(PPO pid=931) 2023-04-15 02:58:26,020   ERROR worker.py:763 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=931, ip=10.138.0.31, repr=PPO)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 441, in __init__
(PPO pid=931)     super().__init__(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 169, in __init__
(PPO pid=931)     self.setup(copy.deepcopy(self.config))
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 566, in setup
(PPO pid=931)     self.workers = WorkerSet(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 191, in __init__
(PPO pid=931)     raise e.args[0].args[2]
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 712, in __init__
(PPO pid=931)     self._build_policy_map(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1970, in _build_policy_map
(PPO pid=931)     self.policy_map.create_policy(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy_map.py", line 146, in create_policy
(PPO pid=931)     policy = create_policy_for_framework(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/policy.py", line 123, in create_policy_for_framework
(PPO pid=931)     return policy_class(observation_space, action_space, merged_config)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 158, in __init__
(PPO pid=931)     super(TracedEagerPolicy, self).__init__(*args, **kwargs)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/ppo_tf_policy.py", line 102, in __init__
(PPO pid=931)     self.maybe_initialize_optimizer_and_loss()
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 424, in maybe_initialize_optimizer_and_loss
(PPO pid=931)     self._initialize_loss_from_dummy_batch(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy.py", line 1261, in _initialize_loss_from_dummy_batch
(PPO pid=931)     actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 133, in _func
(PPO pid=931)     return obj(self_, *args, **kwargs)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 185, in compute_actions_from_input_dict
(PPO pid=931)     return super(TracedEagerPolicy, self).compute_actions_from_input_dict(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 461, in compute_actions_from_input_dict
(PPO pid=931)     ret = self._compute_actions_helper(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
(PPO pid=931)     return func(self, *a, **k)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 818, in _compute_actions_helper
(PPO pid=931)     dist_inputs, state_out = self.model(
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/modelv2.py", line 259, in __call__
(PPO pid=931)     res = self.forward(restored, state or [], seq_lens)
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/models/tf/fcnet.py", line 148, in forward
(PPO pid=931)     model_out, self._value_out = self.base_model(input_dict["obs_flat"])
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
(PPO pid=931)     raise e.with_traceback(filtered_tb) from None
(PPO pid=931)   File "/home/ray/anaconda3/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7164, in raise_from_not_ok_status
(PPO pid=931)     raise core._status_to_exception(e) from None  # pylint: disable=protected-access
(PPO pid=931) tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer "fc_value_1" (type Dense).
(PPO pid=931) 
(PPO pid=931) Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]
(PPO pid=931) 
(PPO pid=931) Call arguments received by layer "fc_value_1" (type Dense):
(PPO pid=931)   • inputs=tf.Tensor(shape=(32, 57), dtype=float32)
2023-04-15 02:58:26,148 ERROR tune.py:758 -- Trials did not complete: [PPO_random_env_0621a_00000]
2023-04-15 02:58:26,149 INFO tune.py:762 -- Total run time: 17.89 seconds (17.30 seconds for the tuning loop).

Versions / Dependencies

Ray 2.2.0 Ubuntu 20.04

Reproduction script

The following YAML is used for the cluster on GCP:

# An unique identifier for the head node and workers of this cluster.
cluster_name: gpu-docker

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 0

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "rayproject/ray-ml:2.2.0-py39-gpu"
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_nvidia_docker" # e.g. ray_docker

    # # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"

    # worker_image: "rayproject/ray-ml:latest"

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-a
    project_id: <PROJECT_ID> # Globally unique project id

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray_head_gpu:
        # The resources provided by this node type.
        resources: {"CPU": 12, "GPU": 1}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-custom-12-79872
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 2000
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu110
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            guestAccelerators:
              - acceleratorType: nvidia-tesla-v100
                acceleratorCount: 1
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            scheduling:
              - onHostMaintenance: TERMINATE
            serviceAccounts:
              - email: ray-autoscaler-sa-v1@<PROJECT_ID>.iam.gserviceaccount.com
                scopes:
                - https://www.googleapis.com/auth/cloud-platform

    ray_worker_gpu:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of workers nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 2, "GPU": 1}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu110
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            guestAccelerators:
              - acceleratorType: nvidia-tesla-k80
                acceleratorCount: 1
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
            scheduling:
              - preemptible: true
              - onHostMaintenance: TERMINATE

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_gpu

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

initialization_commands:
    # Wait until nvidia drivers are installed
    - >-
      timeout 300 bash -c "
          command -v nvidia-smi && nvidia-smi
          until [ \$? -eq 0 ]; do
              command -v nvidia-smi && nvidia-smi
          done"

# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands: 
  - python -m pip install pympler
  - python -m pip install gcsfs

    # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
      --include-dashboard=true

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

I use the Ray Jobs API to submit the following job:

import gym
import numpy as np

import ray
from ray import air, tune
from ray.rllib.algorithms.ppo.ppo import PPOConfig
from ray.rllib.examples.env.random_env import RandomEnv
from ray.tune import register_env

observation_space = gym.spaces.Box(
    float("-inf"), float("inf"), (57,), np.float32)

action_space = gym.spaces.Box(-10, 10, shape=(20,), dtype=np.float32)

if __name__ == '__main__':
    ray.init(local_mode=False)

    run_config = (
        PPOConfig()
        .environment(
            env="random_env",
            env_config={
                "action_space": action_space,
                "observation_space": observation_space,            
            },    
        )
        .rollouts(
            num_rollout_workers=2,
            batch_mode="truncate_episodes",
            rollout_fragment_length=1000,
            soft_horizon=True,
            no_done_at_end=True, # GAE uses VF model
            observation_filter="MeanStdFilter",
        )
        .resources(
            num_cpus_per_worker=1,
            num_gpus=0.4,
            num_gpus_per_worker=0.3
        )
        .framework(
            framework="tf2",
            eager_tracing=True,
        )
        .training(
            train_batch_size=4000,
            # PPO specific.
            sgd_minibatch_size=4000,
            num_sgd_iter=1,
            kl_coeff=0.0,
            entropy_coeff=0.02,
            grad_clip=0.5,
            lr=5e-6,
            vf_clip_param=float("inf"),
            vf_loss_coeff=0.1,
            model={
                "vf_share_layers": False,
                "use_lstm": False,
                "max_seq_len": 20,
                "lstm_use_prev_action": False,
            },
        )
        .debugging(
            log_level="WARNING",
            # Log system resource usage.
            log_sys_usage=True,
            seed=42,
        )
        .evaluation(
            evaluation_num_workers=0,            
        )
    )

    def env_creator(config):
        env = RandomEnv(
            config=config,
        )
        return env    

    register_env("random_env", env_creator)

    stop = {
        "training_iteration": tune.grid_search([3]),
    }

    tuner = tune.Tuner(
        "PPO",
        param_space=run_config.to_dict(),
        run_config=air.RunConfig(
            stop=stop,
            verbose=1,
            checkpoint_config=air.CheckpointConfig(
                checkpoint_frequency=1,
                checkpoint_at_end=True,
            )
        )
    )
    tuner.fit()

Issue Severity

High: It blocks me from completing my task.

Alpha-Girl commented 1 year ago

I face it, too.

simonsays1980 commented 1 year ago

This error still prevails in 2.6.1. Running on GCP.

num_gpus_per_worker=0: all okay num_gpus_per_workr>0: raises the error

:job_id:03000000
DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
:actor_name:RolloutWorker
2023-08-02 03:24:56,691 DEBUG rollout_worker.py:1761 -- Creating policy for default_policy
2023-08-02 03:24:56,692 DEBUG catalog.py:793 -- Created preprocessor <ray.rllib.models.preprocessors.NoPreprocessor object at 0x7ef964d04c40>: Box(-inf, inf, (22,), float32) -> (22,)
2023-08-02 03:24:58,530 INFO eager_tf_policy_v2.py:80 -- Creating TF-eager policy running on GPU.
2023-08-02 03:24:59,605 INFO policy.py:1294 -- Policy (worker=2) running on 0.2 GPUs.
2023-08-02 03:24:59,605 INFO eager_tf_policy_v2.py:99 -- Found 1 visible cuda devices.
2023-08-02 03:25:00,101 ERROR checker.py:262 -- Exception Exception encountered when calling layer 'dense' (type Dense).

{{function_node __wrapped__MatMul_device_/job:localhost/replica:0/task:0/device:GPU:0}} Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]

Call arguments received by layer 'dense' (type Dense):
  • inputs=tf.Tensor(shape=(32, 22), dtype=float32) raised on function call without checkin input specs. RLlib will now attempt to check the spec before calling the function again.
2023-08-02 03:25:00,102 ERROR checker.py:262 -- Exception Exception encountered when calling layer 'dense' (type Dense).

{{function_node __wrapped__MatMul_device_/job:localhost/replica:0/task:0/device:GPU:0}} Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]

Call arguments received by layer 'dense' (type Dense):
  • inputs=tf.Tensor(shape=(32, 22), dtype=float32) raised on function call without checkin input specs. RLlib will now attempt to check the spec before calling the function again.
2023-08-02 03:25:00,102 ERROR checker.py:262 -- Exception Exception encountered when calling layer 'dense' (type Dense).

{{function_node __wrapped__MatMul_device_/job:localhost/replica:0/task:0/device:GPU:0}} Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]

Call arguments received by layer 'dense' (type Dense):
  • inputs=tf.Tensor(shape=(32, 22), dtype=float32) raised on function call without checkin input specs. RLlib will now attempt to check the spec before calling the function again.
Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=7708, ip=10.138.0.37, actor_id=8a5109e144dfac7424f941f503000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7ef965746d90>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 525, in __init__
    self._update_policy_map(policy_dict=self.policy_dict)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1727, in _update_policy_map
    self._build_policy_map(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1838, in _build_policy_map
    new_policy = create_policy_for_framework(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/policy.py", line 139, in create_policy_for_framework
    return policy_class(observation_space, action_space, merged_config)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 168, in __init__
    super(TracedEagerPolicy, self).__init__(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/ppo_tf_policy.py", line 100, in __init__
    self.maybe_initialize_optimizer_and_loss()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 487, in maybe_initialize_optimizer_and_loss
    self._initialize_loss_from_dummy_batch(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy.py", line 1418, in _initialize_loss_from_dummy_batch
    actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 143, in _func
    return obj(self_, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 219, in compute_actions_from_input_dict
    return super(TracedEagerPolicy, self).compute_actions_from_input_dict(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 536, in compute_actions_from_input_dict
    ret = self._compute_actions_helper_rl_module_explore(input_dict)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 901, in _compute_actions_helper_rl_module_explore
    fwd_out = self.model.forward_exploration(input_dict)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 296, in wrapper
    raise initial_exception
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 255, in wrapper
    return func(self, input_data, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 359, in wrapper
    output_data = func(self, input_data, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/rl_module/rl_module.py", line 571, in forward_exploration
    return self._forward_exploration(batch, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/tf/ppo_tf_rl_module.py", line 44, in _forward_exploration
    encoder_outs = self.encoder(batch)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 296, in wrapper
    raise initial_exception
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 255, in wrapper
    return func(self, input_data, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/tf/base.py", line 79, in call
    return self._forward(input_dict, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/base.py", line 343, in _forward
    actor_out = self.actor_encoder(inputs, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 296, in wrapper
    raise initial_exception
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 255, in wrapper
    return func(self, input_data, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/tf/base.py", line 79, in call
    return self._forward(input_dict, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/tf/encoder.py", line 160, in _forward
    return {ENCODER_OUT: self.net(inputs[SampleBatch.OBS])}
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/tf/primitives.py", line 98, in call
    return self.network(inputs)
tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer 'dense' (type Dense).

{{function_node __wrapped__MatMul_device_/job:localhost/replica:0/task:0/device:GPU:0}} Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]

Call arguments received by layer 'dense' (type Dense):
  • inputs=tf.Tensor(shape=(32, 22), dtype=float32)

Same error in this discuss.io thread

Could this be a problem with the TensorFlow version (2.11.0 ) and the CUDA version (11.8)? Following the guidelines here it should be either TF v2.11.0 with CUDA v11.2 or TF v.2.12.0 with CUDA v11.8. Same holds for the CuDNN versions.

The GRAM should not be the problem as the environment step gives a Tensor of shape (1,22) and num_envs_per_worker=4. The TESLA V100 has 16384MiB GRAM. I use the standard model for PPO.

simonsays1980 commented 1 year ago

The same happens with num_gpus=0 and num_gpus_per_learner_worker using fractional values (using 1 works). This happens on GCP with the cluster.yaml described as above. The CUDA version is v11.8 and the TF version is 2.11.0 which is not suggested. I installed TF v2.13.0 which gives an error

AttributeError: module 'tensorflow.python.framework.type_spec' has no attribute '_NAME_TO_TYPE_SPEC'

So I installed TF v.2.12.0. That has problems with protobuf:

Status message: Job failed due to an application error, last available logs (truncated to 20,000 chars):
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

I installed protobuf==3.20.3. This gives however the same error as above. So, I don't believe that the CUDA-TF compatibility is the reason.

What I observe with nvidia-smi is that the GRAM is running full immediately when fractional GPUs are used for the learner worker. I always understood that this is actually avoided by the gpu_options.allow_growth=True in the tf_session_args that are set in the default algorithm_config? Could it be that this is actually not working on the GCP machine created by the cluster.yaml from above (example from ray autoscaler)?

@avnishn @sven1977 Do you have an idea what could cause this behavior?

lyzyn commented 11 months ago

I noticed that you have also encountered this issue. Have you resolved it? (PPO pid=1954927) 2023-08-17 11:19:05,083 WARNING algorithm_config.py:656 -- Cannot create PPOConfig from given config_dict! Property stdout_file not supported.

shengchao-y commented 11 months ago

@simonsays1980 Have you solved this problem? I am facing exactly the same issue.

Although I have num_gpus_per_worker=0.5, the first worker on each GPU always allocates almost the whole GPU memory (e.g. 11.6 GB). Consequently, the subsequent worker allocated to the same GPU encounters a memory availability issue; sometimes it successfully initializes with the remaining 0.6 GB, and other times it fails with the "Attempting to perform BLAS..." error.