google-deepmind / dqn_zoo

DQN Zoo is a collection of reference implementations of reinforcement learning agents developed at DeepMind based on the Deep Q-Network (DQN) agent.
Apache License 2.0
451 stars 78 forks source link

Value 'sm_80' is not defined for option 'gpu-name' #14

Closed yueyang130 closed 2 years ago

yueyang130 commented 2 years ago

I'm running your code with the docker built by run.sh and DockerFile. The GPU I use is Tesla A100, which has compute capability sm_80.

When I run the training code, I have the following error.

I0123 06:16:37.701375 139677848946496 run_atari.py:97] Rainbow on Atari on gpu.
2022-01-23 06:16:37.706684: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:97] Unknown compute capability (8, 0) .Defaulting to telling LLVM that we're compiling for sm_75
2022-01-23 06:16:37.736490: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:419] ptxas returned an error during compilation of ptx to sass: 'Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Value 'sm_80' is not defined for option 'gpu-name'
'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
Fatal Python error: Aborted

Current thread 0x00007f094891c740 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/jax/interpreters/xla.py", line 268 in xla_primitive_callable
  File "/usr/local/lib/python3.6/dist-packages/jax/interpreters/xla.py", line 228 in apply_primitive
  File "/usr/local/lib/python3.6/dist-packages/jax/core.py", line 273 in bind
  File "/usr/local/lib/python3.6/dist-packages/jax/lax/lax.py", line 342 in shift_right_logical
  File "/usr/local/lib/python3.6/dist-packages/jax/random.py", line 87 in PRNGKey
  File "/global_fs/dqn_zoo/dqn_zoo/rainbow/run_atari.py", line 100 in main
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250 in _run_main
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299 in run
  File "/global_fs/dqn_zoo/dqn_zoo/rainbow/run_atari.py", line 280 in <module>
  File "/usr/lib/python3.6/runpy.py", line 85 in _run_code
  File "/usr/lib/python3.6/runpy.py", line 193 in _run_module_as_main
Aborted

I guess the problem is raised by the imcompatibility bwtween A100 sm_80 and CUDA10.1. But I am only familiar with pytorch and completely new to jax and tensorflow. Can you tell me which package version in Dockerfile and docker_requirements.txt should be changed if I want to run your code on A100?

Thanks!

yueyang130 commented 2 years ago

I sovled the problem by using newer cuda version docker.


FROM nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04