I'm running your code with the docker built by run.sh and DockerFile. The GPU I use is Tesla A100, which has compute capability sm_80.
When I run the training code, I have the following error.
I0123 06:16:37.701375 139677848946496 run_atari.py:97] Rainbow on Atari on gpu.
2022-01-23 06:16:37.706684: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:97] Unknown compute capability (8, 0) .Defaulting to telling LLVM that we're compiling for sm_75
2022-01-23 06:16:37.736490: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:419] ptxas returned an error during compilation of ptx to sass: 'Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value 'sm_80' is not defined for option 'gpu-name'
' If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
Fatal Python error: Aborted
Current thread 0x00007f094891c740 (most recent call first):
File "/usr/local/lib/python3.6/dist-packages/jax/interpreters/xla.py", line 268 in xla_primitive_callable
File "/usr/local/lib/python3.6/dist-packages/jax/interpreters/xla.py", line 228 in apply_primitive
File "/usr/local/lib/python3.6/dist-packages/jax/core.py", line 273 in bind
File "/usr/local/lib/python3.6/dist-packages/jax/lax/lax.py", line 342 in shift_right_logical
File "/usr/local/lib/python3.6/dist-packages/jax/random.py", line 87 in PRNGKey
File "/global_fs/dqn_zoo/dqn_zoo/rainbow/run_atari.py", line 100 in main
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250 in _run_main
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299 in run
File "/global_fs/dqn_zoo/dqn_zoo/rainbow/run_atari.py", line 280 in <module>
File "/usr/lib/python3.6/runpy.py", line 85 in _run_code
File "/usr/lib/python3.6/runpy.py", line 193 in _run_module_as_main
Aborted
I guess the problem is raised by the imcompatibility bwtween A100 sm_80 and CUDA10.1. But I am only familiar with pytorch and completely new to jax and tensorflow. Can you tell me which package version in Dockerfile and docker_requirements.txt should be changed if I want to run your code on A100?
I'm running your code with the docker built by
run.sh
andDockerFile
. The GPU I use is Tesla A100, which has compute capability sm_80.When I run the training code, I have the following error.
I guess the problem is raised by the imcompatibility bwtween A100 sm_80 and CUDA10.1. But I am only familiar with pytorch and completely new to jax and tensorflow. Can you tell me which package version in
Dockerfile
anddocker_requirements.txt
should be changed if I want to run your code on A100?Thanks!