Open trevormcinroe opened 2 years ago
Hi, several observations from your log:
There are some cuda packages that were not successfully loaded:
2022-08-18 18:37:54.868858: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/cv2/../../lib64::/home/lukas/.mujoco/mujoco200/bin:/home/lukas/.mujoco/mujoco210/bin:/usr/lib/nvidia
2022-08-18 18:37:54.869412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-08-18 18:37:54.869517: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-08-18 18:37:54.869527: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
I did not have this issue.
Prefill dataset with 0 steps.
is strange. Did you already run the code before and have already collected 5000 steps data in the target log directory?
Found 253201 disen_reward parameters.
There should be several other lines of output about #params of other model components.
You are using a different version of tensorflow
and tensorflow-probability
. I also don't see tensorflow-gpu
.
Hi kyonofx,
Thank you for getting back to us so quickly.
One quick question. What version of CUDA are you running on your system?
In the codebase's README, it suggests to use tensorflow-gpu==2.3.1
. According to this table, this version of tensorflow-gpu
requires CUDA < 11. Unfortunately, we cannot downgrade the CUDA version on our GPUs. We might have versioning issues with tensorflow-gpu
and tensorflow_probability
.
I used CUDA 10.1. It is possible to have multiple CUDA versions on the same server machine, you can install CUDA 10.1 as long as you have sudo permission. Maybe this would be the main cause?
Hi,
The provided code errors out with the following:
python run.py --method tia --configs dmc --task dmc_cartpole_swingup_none --logdir ./
I believe the below problem occurs in the call to
self.train(next(self._dataset))
during the initialization of theSeparationDreamer
class. This is found on line 497 here: https://github.com/kyonofx/tia/blob/main/Dreamer/dreamers.py#L497Below is the full error:
We attempted to run the above script on two different systems, but both gave the same error. For reference, one system is running cuda 11.6 and the other cuda 11.4.
Also, below is the package list we are using: