Open simonsays1980 opened 1 year ago
I face it, too.
This error still prevails in 2.6.1
. Running on GCP.
num_gpus_per_worker=0
: all okay
num_gpus_per_workr>0
: raises the error
:job_id:03000000
DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
:actor_name:RolloutWorker
2023-08-02 03:24:56,691 DEBUG rollout_worker.py:1761 -- Creating policy for default_policy
2023-08-02 03:24:56,692 DEBUG catalog.py:793 -- Created preprocessor <ray.rllib.models.preprocessors.NoPreprocessor object at 0x7ef964d04c40>: Box(-inf, inf, (22,), float32) -> (22,)
2023-08-02 03:24:58,530 INFO eager_tf_policy_v2.py:80 -- Creating TF-eager policy running on GPU.
2023-08-02 03:24:59,605 INFO policy.py:1294 -- Policy (worker=2) running on 0.2 GPUs.
2023-08-02 03:24:59,605 INFO eager_tf_policy_v2.py:99 -- Found 1 visible cuda devices.
2023-08-02 03:25:00,101 ERROR checker.py:262 -- Exception Exception encountered when calling layer 'dense' (type Dense).
{{function_node __wrapped__MatMul_device_/job:localhost/replica:0/task:0/device:GPU:0}} Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]
Call arguments received by layer 'dense' (type Dense):
• inputs=tf.Tensor(shape=(32, 22), dtype=float32) raised on function call without checkin input specs. RLlib will now attempt to check the spec before calling the function again.
2023-08-02 03:25:00,102 ERROR checker.py:262 -- Exception Exception encountered when calling layer 'dense' (type Dense).
{{function_node __wrapped__MatMul_device_/job:localhost/replica:0/task:0/device:GPU:0}} Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]
Call arguments received by layer 'dense' (type Dense):
• inputs=tf.Tensor(shape=(32, 22), dtype=float32) raised on function call without checkin input specs. RLlib will now attempt to check the spec before calling the function again.
2023-08-02 03:25:00,102 ERROR checker.py:262 -- Exception Exception encountered when calling layer 'dense' (type Dense).
{{function_node __wrapped__MatMul_device_/job:localhost/replica:0/task:0/device:GPU:0}} Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]
Call arguments received by layer 'dense' (type Dense):
• inputs=tf.Tensor(shape=(32, 22), dtype=float32) raised on function call without checkin input specs. RLlib will now attempt to check the spec before calling the function again.
Exception raised in creation task: The actor died because of an error raised in its creation task, [36mray::RolloutWorker.__init__()[39m (pid=7708, ip=10.138.0.37, actor_id=8a5109e144dfac7424f941f503000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7ef965746d90>)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 525, in __init__
self._update_policy_map(policy_dict=self.policy_dict)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1727, in _update_policy_map
self._build_policy_map(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1838, in _build_policy_map
new_policy = create_policy_for_framework(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/policy.py", line 139, in create_policy_for_framework
return policy_class(observation_space, action_space, merged_config)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 168, in __init__
super(TracedEagerPolicy, self).__init__(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/ppo_tf_policy.py", line 100, in __init__
self.maybe_initialize_optimizer_and_loss()
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 487, in maybe_initialize_optimizer_and_loss
self._initialize_loss_from_dummy_batch(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/policy.py", line 1418, in _initialize_loss_from_dummy_batch
actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 143, in _func
return obj(self_, *args, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py", line 219, in compute_actions_from_input_dict
return super(TracedEagerPolicy, self).compute_actions_from_input_dict(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 536, in compute_actions_from_input_dict
ret = self._compute_actions_helper_rl_module_explore(input_dict)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
return func(self, *a, **k)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 901, in _compute_actions_helper_rl_module_explore
fwd_out = self.model.forward_exploration(input_dict)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 296, in wrapper
raise initial_exception
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 255, in wrapper
return func(self, input_data, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 359, in wrapper
output_data = func(self, input_data, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/rl_module/rl_module.py", line 571, in forward_exploration
return self._forward_exploration(batch, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/algorithms/ppo/tf/ppo_tf_rl_module.py", line 44, in _forward_exploration
encoder_outs = self.encoder(batch)
File "/home/ray/anaconda3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 296, in wrapper
raise initial_exception
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 255, in wrapper
return func(self, input_data, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/tf/base.py", line 79, in call
return self._forward(input_dict, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/base.py", line 343, in _forward
actor_out = self.actor_encoder(inputs, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 296, in wrapper
raise initial_exception
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/specs/checker.py", line 255, in wrapper
return func(self, input_data, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/tf/base.py", line 79, in call
return self._forward(input_dict, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/tf/encoder.py", line 160, in _forward
return {ENCODER_OUT: self.net(inputs[SampleBatch.OBS])}
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/rllib/core/models/tf/primitives.py", line 98, in call
return self.network(inputs)
tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer 'dense' (type Dense).
{{function_node __wrapped__MatMul_device_/job:localhost/replica:0/task:0/device:GPU:0}} Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]
Call arguments received by layer 'dense' (type Dense):
• inputs=tf.Tensor(shape=(32, 22), dtype=float32)
Same error in this discuss.io thread
Could this be a problem with the TensorFlow version (2.11.0 ) and the CUDA version (11.8)? Following the guidelines here it should be either TF v2.11.0 with CUDA v11.2 or TF v.2.12.0 with CUDA v11.8. Same holds for the CuDNN versions.
The GRAM should not be the problem as the environment step gives a Tensor of shape (1,22) and num_envs_per_worker=4
. The TESLA V100 has 16384MiB GRAM. I use the standard model for PPO.
The same happens with num_gpus=0
and num_gpus_per_learner_worker
using fractional values (using 1
works). This happens on GCP with the cluster.yaml
described as above. The CUDA version is v11.8
and the TF version is 2.11.0 which is not suggested. I installed TF v2.13.0
which gives an error
AttributeError: module 'tensorflow.python.framework.type_spec' has no attribute '_NAME_TO_TYPE_SPEC'
So I installed TF v.2.12.0
. That has problems with protobuf:
Status message: Job failed due to an application error, last available logs (truncated to 20,000 chars):
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
I installed protobuf==3.20.3
. This gives however the same error as above. So, I don't believe that the CUDA-TF compatibility is the reason.
What I observe with nvidia-smi
is that the GRAM is running full immediately when fractional GPUs are used for the learner worker. I always understood that this is actually avoided by the gpu_options.allow_growth=True
in the tf_session_args
that are set in the default algorithm_config
? Could it be that this is actually not working on the GCP machine created by the cluster.yaml
from above (example from ray autoscaler)?
@avnishn @sven1977 Do you have an idea what could cause this behavior?
I noticed that you have also encountered this issue. Have you resolved it? (PPO pid=1954927) 2023-08-17 11:19:05,083 WARNING algorithm_config.py:656 -- Cannot create PPOConfig from given config_dict! Property stdout_file not supported.
@simonsays1980 Have you solved this problem? I am facing exactly the same issue.
Although I have num_gpus_per_worker=0.5, the first worker on each GPU always allocates almost the whole GPU memory (e.g. 11.6 GB). Consequently, the subsequent worker allocated to the same GPU encounters a memory availability issue; sometimes it successfully initializes with the remaining 0.6 GB, and other times it fails with the "Attempting to perform BLAS..." error.
What happened + What you expected to happen
What happened
I start a GPU cluster (see the script below) via the Autoscaler on GCP and then run a simple script using RLlib (and Tune). I get an error that basically tells me:
See below for the complete error output. Usually reason for this: no GPU memory growth. But this is enabled by default in RLlib. Also, I use the default model in RLlib that should not be that big in memory that it occupies the whole VRAM (though the trainer process does claim a lot of it). The script runs, if I do not fraction the GPU.
Maybe some experience could be shared here from the Ray development team in regard to GPU fractioning and VRAM model sizes (or best practices).
What I expected to happen
That the example script runs through seemlessly with this small model and two workers only on a NVIDIA Tesla V100 with around 16GB VRAM.
Here is the complete error message:
Versions / Dependencies
Ray 2.2.0 Ubuntu 20.04
Reproduction script
The following YAML is used for the cluster on GCP:
I use the Ray Jobs API to submit the following job:
Issue Severity
High: It blocks me from completing my task.