Error when run `python og_marl/tf2/systems/iql_cql.py`

ZijunSong commented 1 month ago

Hello, I am very interested in your outstanding work, but I encountered a minor issue while attempting to reproduce it. I followed the steps in your Readme to configure the environment as follows:

git clone https://github.com/instadeepai/og-marl.git 
pip install -r requirements.txt 
pip install -e . 
bash install_environments/smacv1.sh 
pip install -r install_environments/requirements/smacv1.txt

All configurations were completed successfully. However, when I tried to run

python og_marl/tf2/systems/iql_cql.py task.source=og_marl task.env=smac_v1 task.scenario=3m task.dataset=Good

I received the following error:

2024-10-29 13:56:20.393968: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-29 13:56:20.433449: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-29 13:56:20.433491: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-29 13:56:20.434731: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-29 13:56:20.441134: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-29 13:56:21.266027: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-10-29 13:56:22.862481: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-29 13:56:22.906527: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-29 13:56:22.906887: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
{'system_name': 'iql+cql', 'seed': 42, 'wandb_project': 'og-marl', 'training_steps': 100000.0, 'task': {'source': 'og_marl', 'env': 'smac_v1', 'scenario': '3m', 'dataset': 'Good'}, 'replay': {'sequence_length': 20, 'sample_period': 1}, 'system': {'learning_rate': 0.0003, 'linear_layer_dim': 64, 'recurrent_layer_dim': 64, 'discount': 0.99, 'target_update_period': 200, 'add_agent_id_to_obs': True, 'cql_weight': 3.0}}
/root/miniconda3/envs/ogmarl/lib/python3.10/site-packages/flashbax/buffers/trajectory_buffer.py:473: UserWarning: Setting max_size dynamically sets the `max_length_time_axis` to be `max_size`//`add_batch_size = 50000`.This allows one to control exactly how many timesteps are stored in the buffer.Note that this overrides the `max_length_time_axis` argument.
  warnings.warn(
/root/miniconda3/envs/ogmarl/lib/python3.10/site-packages/flashbax/buffers/trajectory_buffer.py:498: UserWarning: `sample_sequence_length` greater than `min_length_time_axis`, therefore overriding `min_length_time_axis`to be set to `sample_sequence_length`, as we need at least `sample_sequence_length` timesteps added to the buffer before we can sample.
  warnings.warn(
[2024-10-29 13:56:23,154][jax._src.xla_bridge][INFO] - Unable to initialize backend 'cuda': 
[2024-10-29 13:56:23,154][jax._src.xla_bridge][INFO] - Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2024-10-29 13:56:23,155][jax._src.xla_bridge][INFO] - Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
[2024-10-29 13:56:23,156][jax._src.xla_bridge][WARNING] - An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
Dataset from https://huggingface.co/datasets/InstaDeepAI/og-marl/resolve/main/core/smac_v1/3m.zip could not be downloaded. Try entering a different URL, or removing the part which auto-downloads.
Error executing job with overrides: ['task.source=og_marl', 'task.env=smac_v1', 'task.scenario=3m', 'task.dataset=Good']
Traceback (most recent call last):
  File "/root/autodl-tmp/og-marl/og_marl/tf2/systems/iql_cql.py", line 273, in run_experiment
    buffer.populate_from_vault(cfg["task"]["source"], cfg["task"]["env"], cfg["task"]["scenario"], cfg["task"]["dataset"])
  File "/root/autodl-tmp/og-marl/og_marl/replay_buffers.py", line 99, in populate_from_vault
    self._buffer_state = Vault(
  File "/root/miniconda3/envs/ogmarl/lib/python3.10/site-packages/flashbax/vault/vault.py", line 169, in __init__
    raise ValueError(
ValueError: Vault does not exist and no experience_structure was provided.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I suspect the issue may stem from my use of CUDA 12.1. When I attempted to update certain packages, such as via

pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

I encountered environment conflicts. Therefore, I am reaching out to kindly ask for your guidance in resolving this issue. Thank you very much!

jcformanek commented 1 month ago

Hi there, thank you so much for reaching out. I am happy to help. I think the problem is actually that the dataset was downloaded to the wrong directory. This is the important part of the error:

ValueError: Vault does not exist and no experience_structure was provided.

You should have a directory called vaults/og_marl/smac_v1/3m.vlt/Good/.

Do you have such a directory? Take careful note of the .vlt after 3m. This directory should have been downloaded when you ran the script. But maybe the script has a but, I will check.

jcformanek commented 1 month ago

As an aside, I recommend not downloading the "cuda" version of Jax. We only use Jax for the replay buffer, so the CPU version is fine. Rather download the "cuda" version of Tensorflow (which is what we have in the requirements file). Having the cuda version of both Jax and TF is possible but can easily result in dependency conflicts.

ZijunSong commented 1 month ago

Thank you so much for your help! Your response resolved a major issue for me. When I ran examples/download_dataset.py, I was able to download the data, and the problem was completely resolved. Thanks again for your assistance!

jcformanek commented 1 month ago

That's great! Don't hesitate to ask any further questions! I am happy to help!

instadeepai / og-marl

Error when run `python og_marl/tf2/systems/iql_cql.py` #49