Could you show the GPU usage during training or share your full wandb runing logs?

ygtxr1997 commented 3 weeks ago

Very Nice work! I wonder why my averaged GPU (NVIDIA A100) usage rate was pretty low (around 20%~40%). Could you share your GPU usage logs during training as the title mentioned?

mbreuss commented 3 weeks ago

Does this help? GPUs are 4 A6000. I think the limiting factor is often teh data loading since CALVIN stores every state as a seperate file instead of trajectories. I took over their HUlC code for that. They also often SHM options but it did not work for me.

Also please note that the model is tested with 1k rollouts after epocj 30, 35 and 40. Thus, there are some gaps in the later stages.

Let me know if you have more questions!

ygtxr1997 commented 3 weeks ago

Thank you so much! The image clearly shows the GPU utilization during training, demonstrating that there is truely some bottleneck, potentially caused by the data loading process. I will check it. If I found any solution to solve the issue, I would update here.

mbreuss commented 3 weeks ago

Keep me updated. Recently to other labs provide different implementations of CALVIN Teaching that could help: https://github.com/bytedance/GR-MG/tree/main and https://github.com/nickgkan/3d_diffuser_actor

Maybe these are better in terms of speed. Sadly I had no time to try them out myself yet.

ygtxr1997 commented 3 weeks ago

Your provided information helps a lot. I'll take into account these repos.

ygtxr1997 commented 1 week ago

I think I found some issues related to low GPU utilization problem. As shown in the above figure, to load 'actions' from 1 to 10, your current code has to load the full episode_xxx.npz files. That means for each step your running machine needs to load 1126.4MB data per step. If your data is stored in a remote server, this requires a ~10Gbps bandwidth. Sadlly, our network bandwidth is only ~2Gbps(200MB/s). Therefore, our code runs very slowly, about 3hours for the 1st epoch.

To solve this, I pre-read the actions data from episode_xxx.npz files and save them into a single npy file (~30MB file size). This boost the data loading speed in my case, as shown below.

But if your data are stored in a local SSD, I think this issue would be easily alleviated.

mbreuss commented 1 week ago

Amazing work! Can you share the code on how to implement that? Lowering the bandwith sounds good for every cluster.

ygtxr1997 commented 1 week ago

Yes, I would like to share my code about extracting actions data into a single *.npy file and reading action data from this file. But it will cost a few days to reorganize my code. Maybe I would make a pull request in a few days.

mbreuss commented 1 week ago

Sounds good, looking forward to it!

HUAFOR commented 1 week ago

I’m experiencing the same issue. The first epoch takes a long time when I use four A100 GPUs to replicate the experiment. Looking forward to the solutions!

ygtxr1997 commented 1 day ago

Sounds good, looking forward to it!

@mbreuss Would you mind sharing the detailed list of pip packages you are really using (run pip freeze > real_requirements.txt)? Thus I can make sure our Python environments are totally consistent.

omeryagmurlu commented 9 hours ago

Here's the output of pip freeze from today:

aiohttp==3.9.5
aiosignal==1.3.1
antlr4-python3-runtime==4.8
async-timeout==4.0.3
attrs==23.2.0
av==12.1.0
beautifulsoup4==4.12.3
cachetools==5.3.3
-e git+ssh://git@github.com/mees/calvin_env.git@797142c588c21e76717268b7b430958dbd13bf48#egg=calvin_env&subdirectory=../../calvin_env
certifi==2024.6.2
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
cmake==3.29.5.1
colorlog==6.8.2
contourpy==1.1.1
cycler==0.12.1
decorator==4.4.2
docker-pycreds==0.4.0
einops==0.8.0
einops-exts==0.0.4
filelock==3.15.3
fonttools==4.53.0
freetype-py==2.4.0
frozenlist==1.4.1
fsspec==2024.6.0
ftfy==6.2.0
gdown==5.2.0
gitdb==4.0.11
GitPython==3.1.43
google-api-core==2.19.0
google-auth==2.30.0
google-cloud-core==2.4.1
google-cloud-storage==2.17.0
google-crc32c==1.5.0
google-resumable-media==2.7.1
googleapis-common-protos==1.63.1
gym==0.26.2
gym-notices==0.0.8
h5py==3.11.0
huggingface-hub==0.23.4
hurry.filesize==0.9
hydra-colorlog==1.2.0
hydra-core==1.1.1
idna==3.7
imageio==2.34.1
imageio-ffmpeg==0.5.1
importlib_metadata==7.1.0
importlib_resources==6.4.0
Jinja2==3.1.4
joblib==1.4.2
jsonlines==4.0.0
kiwisolver==1.4.5
lightning-utilities==0.11.2
lit==18.1.7
llvmlite==0.41.1
lxml==5.2.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.7.5
mdurl==0.1.2
moviepy==1.0.3
mpmath==1.3.0
multidict==6.0.5
networkx==2.2
nltk==3.8.1
numba==0.58.1
numpy==1.24.4
numpy-quaternion==2023.0.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
omegaconf==2.1.2
opencv-python==4.10.0.84
packaging==24.1
pandas==2.0.3
pillow==10.3.0
platformdirs==4.2.2
plotly==5.22.0
proglog==0.1.10
proto-plus==1.24.0
protobuf==4.25.3
psutil==6.0.0
pyasn1==0.6.0
pyasn1_modules==0.4.0
pybullet==3.2.6
pycollada==0.6
pyglet==2.0.15
Pygments==2.18.0
pyhash==0.9.3
PyOpenGL @ git+https://github.com/mmatl/pyopengl.git@76d1261adee2d3fd99b418e75b0416bb7d2865e6
pyparsing==3.1.2
pyrender==0.1.45
PySocks==1.7.1
python-dateutil==2.9.0.post0
pytorch-lightning==1.9.5
pytz==2024.1
PyYAML==6.0.1
regex==2024.5.15
requests==2.32.3
rich==13.7.1
rsa==4.9
scikit-learn==1.3.2
scipy==1.10.1
sentence-transformers==2.2.2
sentencepiece==0.2.0
sentry-sdk==2.6.0
setproctitle==1.3.3
six==1.16.0
smmap==5.0.1
soupsieve==2.5
sympy==1.12.1
-e git+https://github.com/lukashermann/tacto.git@dd53360d9a8c186f0d6439372ec0be0fa5e21731#egg=tacto&subdirectory=../../../../calvin_env/tacto
tenacity==8.4.1
termcolor==2.4.0
threadpoolctl==3.5.0
tokenizers==0.13.3
torch==2.0.1
torchdiffeq==0.2.4
torchmetrics==1.4.0.post0
torchsde==0.2.6
torchvision==0.15.2
tqdm==4.66.4
trampoline==0.1.2
transformers==4.25.1
trimesh==4.4.1
triton==2.0.0
typing_extensions==4.12.2
tzdata==2024.1
urdfpy==0.0.22
urllib3==2.2.2
vit-pytorch==1.4.4
voltron-robotics==1.1.0
wandb==0.17.2
wcwidth==0.2.13
yarl==1.9.4
zipp==3.19.2

intuitive-robots / mdt_policy

Could you show the GPU usage during training or share your full wandb runing logs? #6