Open ygtxr1997 opened 3 weeks ago
Does this help? GPUs are 4 A6000. I think the limiting factor is often teh data loading since CALVIN stores every state as a seperate file instead of trajectories. I took over their HUlC code for that. They also often SHM options but it did not work for me.
Also please note that the model is tested with 1k rollouts after epocj 30, 35 and 40. Thus, there are some gaps in the later stages.
Let me know if you have more questions!
Thank you so much! The image clearly shows the GPU utilization during training, demonstrating that there is truely some bottleneck, potentially caused by the data loading process. I will check it. If I found any solution to solve the issue, I would update here.
Keep me updated. Recently to other labs provide different implementations of CALVIN Teaching that could help: https://github.com/bytedance/GR-MG/tree/main and https://github.com/nickgkan/3d_diffuser_actor
Maybe these are better in terms of speed. Sadly I had no time to try them out myself yet.
Your provided information helps a lot. I'll take into account these repos.
I think I found some issues related to low GPU utilization problem.
As shown in the above figure, to load 'actions' from 1
to 10
, your current code has to load the full episode_xxx.npz files. That means for each step your running machine needs to load 1126.4MB data per step. If your data is stored in a remote server, this requires a ~10Gbps bandwidth. Sadlly, our network bandwidth is only ~2Gbps(200MB/s). Therefore, our code runs very slowly, about 3hours for the 1st epoch.
To solve this, I pre-read the actions data from episode_xxx.npz files and save them into a single npy file (~30MB file size). This boost the data loading speed in my case, as shown below.
But if your data are stored in a local SSD, I think this issue would be easily alleviated.
Amazing work! Can you share the code on how to implement that? Lowering the bandwith sounds good for every cluster.
Yes, I would like to share my code about extracting actions data into a single *.npy
file and reading action data from this file. But it will cost a few days to reorganize my code. Maybe I would make a pull request in a few days.
Sounds good, looking forward to it!
I’m experiencing the same issue. The first epoch takes a long time when I use four A100 GPUs to replicate the experiment. Looking forward to the solutions!
Sounds good, looking forward to it!
@mbreuss Would you mind sharing the detailed list of pip packages you are really using (run pip freeze > real_requirements.txt
)? Thus I can make sure our Python environments are totally consistent.
Here's the output of pip freeze
from today:
aiohttp==3.9.5
aiosignal==1.3.1
antlr4-python3-runtime==4.8
async-timeout==4.0.3
attrs==23.2.0
av==12.1.0
beautifulsoup4==4.12.3
cachetools==5.3.3
-e git+ssh://git@github.com/mees/calvin_env.git@797142c588c21e76717268b7b430958dbd13bf48#egg=calvin_env&subdirectory=../../calvin_env
certifi==2024.6.2
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
cmake==3.29.5.1
colorlog==6.8.2
contourpy==1.1.1
cycler==0.12.1
decorator==4.4.2
docker-pycreds==0.4.0
einops==0.8.0
einops-exts==0.0.4
filelock==3.15.3
fonttools==4.53.0
freetype-py==2.4.0
frozenlist==1.4.1
fsspec==2024.6.0
ftfy==6.2.0
gdown==5.2.0
gitdb==4.0.11
GitPython==3.1.43
google-api-core==2.19.0
google-auth==2.30.0
google-cloud-core==2.4.1
google-cloud-storage==2.17.0
google-crc32c==1.5.0
google-resumable-media==2.7.1
googleapis-common-protos==1.63.1
gym==0.26.2
gym-notices==0.0.8
h5py==3.11.0
huggingface-hub==0.23.4
hurry.filesize==0.9
hydra-colorlog==1.2.0
hydra-core==1.1.1
idna==3.7
imageio==2.34.1
imageio-ffmpeg==0.5.1
importlib_metadata==7.1.0
importlib_resources==6.4.0
Jinja2==3.1.4
joblib==1.4.2
jsonlines==4.0.0
kiwisolver==1.4.5
lightning-utilities==0.11.2
lit==18.1.7
llvmlite==0.41.1
lxml==5.2.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.7.5
mdurl==0.1.2
moviepy==1.0.3
mpmath==1.3.0
multidict==6.0.5
networkx==2.2
nltk==3.8.1
numba==0.58.1
numpy==1.24.4
numpy-quaternion==2023.0.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
omegaconf==2.1.2
opencv-python==4.10.0.84
packaging==24.1
pandas==2.0.3
pillow==10.3.0
platformdirs==4.2.2
plotly==5.22.0
proglog==0.1.10
proto-plus==1.24.0
protobuf==4.25.3
psutil==6.0.0
pyasn1==0.6.0
pyasn1_modules==0.4.0
pybullet==3.2.6
pycollada==0.6
pyglet==2.0.15
Pygments==2.18.0
pyhash==0.9.3
PyOpenGL @ git+https://github.com/mmatl/pyopengl.git@76d1261adee2d3fd99b418e75b0416bb7d2865e6
pyparsing==3.1.2
pyrender==0.1.45
PySocks==1.7.1
python-dateutil==2.9.0.post0
pytorch-lightning==1.9.5
pytz==2024.1
PyYAML==6.0.1
regex==2024.5.15
requests==2.32.3
rich==13.7.1
rsa==4.9
scikit-learn==1.3.2
scipy==1.10.1
sentence-transformers==2.2.2
sentencepiece==0.2.0
sentry-sdk==2.6.0
setproctitle==1.3.3
six==1.16.0
smmap==5.0.1
soupsieve==2.5
sympy==1.12.1
-e git+https://github.com/lukashermann/tacto.git@dd53360d9a8c186f0d6439372ec0be0fa5e21731#egg=tacto&subdirectory=../../../../calvin_env/tacto
tenacity==8.4.1
termcolor==2.4.0
threadpoolctl==3.5.0
tokenizers==0.13.3
torch==2.0.1
torchdiffeq==0.2.4
torchmetrics==1.4.0.post0
torchsde==0.2.6
torchvision==0.15.2
tqdm==4.66.4
trampoline==0.1.2
transformers==4.25.1
trimesh==4.4.1
triton==2.0.0
typing_extensions==4.12.2
tzdata==2024.1
urdfpy==0.0.22
urllib3==2.2.2
vit-pytorch==1.4.4
voltron-robotics==1.1.0
wandb==0.17.2
wcwidth==0.2.13
yarl==1.9.4
zipp==3.19.2
Very Nice work! I wonder why my averaged GPU (NVIDIA A100) usage rate was pretty low (around 20%~40%). Could you share your GPU usage logs during training as the title mentioned?