lukashermann / hulc

Hierarchical Universal Language Conditioned Policies
http://hulc.cs.uni-freiburg.de
MIT License
66 stars 9 forks source link

ALSA lib error #18

Open Cherryjingyao opened 8 months ago

Cherryjingyao commented 8 months ago

while I run the traing code I found the ALSA error:

image

ALSA lib conf.c:5180:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory ALSA lib conf.c:5703:(snd_config_expand) Evaluate error: No such file or directory

why this will use ALSA and how can i fix it

lukashermann commented 8 months ago

Can you give a bit more context? At which point does this error occur? what are you running?

Cherryjingyao commented 8 months ago

i'm running the training bash script using the debug dataset
python hulc/training.py +trainer.gpus=-1 +datamodule.root_data_dir=/data/calvin/debug_dataset datamodule/datasets=vision_lang_shm after validation the error comes out

image

ps when i use the ABC_D dataset it collapse when load the data

lukashermann commented 8 months ago

It could be a problem with the GPU renderer when you run the rollouts. Can you try turning off rollouts and see if it still crashes? You have to add ~callbacks/rollout and ~callbacks/rollout_lh to the command line arguments. Which GPU do you have in your machine?

For the bigger dataset, you might not have enough shared memory, so try using the normal dataloader by setting datamodule/datasets=vision_lang.

Cherryjingyao commented 8 months ago

when adding ~callbacks/rollout and ~callbacks/rollout_lh,it shows that: I have 4 A100 with 40G . and 5 EGL devices , with ID=4 can be used

Cherryjingyao commented 8 months ago

when adding ~callbacks/rollout and ~callbacks/rollout_lh,it shows that:

image

I have 4 A100 with 40G . and 5 EGL devices , with ID=4 can be used .so I set EGL_DEVICE_ID=4 to run the code or it will crash

lukashermann commented 8 months ago

ah sorry, my bad, then try only using `~callbacks/rollout_lh~.

Is your Nvidia driver correctly installed? In the log you previously sent it mentions Mesa, which shouldn't be used if you have an Nvidia GPU with the correct driver.

Cherryjingyao commented 8 months ago

how can i not use Mesa? Here is my GPU :

image

I tried using the normal dataloader by setting datamodule/datasets=vision_lang. it can load the data ,but after validation of epoch 0, it crashes again : `[Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800149 milliseconds before timing out. [rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800149 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe1fdae0d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fe1fec886e6 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fe1fec8bc3d in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fe1fec8c839 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd3e95 (0x7fe2489a9e95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7fe24a003609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7fe249dc2133 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800149 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe1fdae0d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fe1fec886e6 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fe1fec8bc3d in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fe1fec8c839 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd3e95 (0x7fe2489a9e95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7fe24a003609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7fe249dc2133 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe1fdae0d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: + 0xdf6b11 (0x7fe1fe9e2b11 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd3e95 (0x7fe2489a9e95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6) frame #3: + 0x8609 (0x7fe24a003609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #4: clone + 0x43 (0x7fe249dc2133 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800818 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800818 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd106797d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd10793f6e6 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd107942c3d in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd107943839 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd3e95 (0x7fd151660e95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7fd152cba609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7fd152a79133 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800818 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd106797d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd10793f6e6 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd107942c3d in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd107943839 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd3e95 (0x7fd151660e95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7fd152cba609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7fd152a79133 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd106797d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: + 0xdf6b11 (0x7fd107699b11 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd3e95 (0x7fd151660e95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6) frame #3: + 0x8609 (0x7fd152cba609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #4: clone + 0x43 (0x7fd152a79133 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800919 milliseconds before timing out. [rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800919 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2aa3076d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f2aa421e6e6 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f2aa4221c3d in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2aa4222839 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd3e95 (0x7f2aedf3fe95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7f2aef599609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f2aef358133 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=63281, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800919 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2aa3076d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f2aa421e6e6 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f2aa4221c3d in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2aa4222839 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd3e95 (0x7f2aedf3fe95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6) frame #5: + 0x8609 (0x7f2aef599609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f2aef358133 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2aa3076d87 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: + 0xdf6b11 (0x7f2aa3f78b11 in /data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd3e95 (0x7f2aedf3fe95 in /data/mamba/envs/robodiff/bin/../lib/libstdc++.so.6) frame #3: + 0x8609 (0x7f2aef599609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #4: clone + 0x43 (0x7f2aef358133 in /lib/x86_64-linux-gnu/libc.so.6)

Error executing job with overrides: ['+trainer.gpus=-1', 'datamodule.root_data_dir=/data/calvin/task_ABC_D', 'datamodule/datasets=vision_lang', '+datamodule.num_workers=1'] Traceback (most recent call last): File "/pfs-data/code/hulc/hulc/training.py", line 76, in train trainer.fit(model, datamodule=datamodule, ckpt_path=chk) # type: ignore File "/data/mamba/envs/robodiff/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/data/mamba/envs/robodiff/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/data/mamba/envs/robodiff/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch mp.start_processes( File "/data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes while not context.join(): File "/data/mamba/envs/robodiff/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 140, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGABRT`

I wonder if this is related to the used of Mesa

lukashermann commented 8 months ago

I just realized there is a mistake in the README file, when pytorch-lightning upgraded to a newer version, they renamed the trainer.gpus argument to trainer.devices. This is already reflected in the code, but not in the documentation. Can you try running it on a single GPU without rollouts and without the shared memory dataloader?

python hulc/training.py trainer.devices=1 datamodule.root_data_dir=/data/calvin/debug_dataset datamodule/datasets=vision_lang ~callbacks/rollout_lh

If this works, then try the multiprocessing version by setting trainer.devices=-1

Cherryjingyao commented 7 months ago

I run this code as you advised ,but I found that the training speed is very slow 4GPU batch size 32 1 epoch will take almost 10 hour.

image

one more strange thing is that when i use debug dataset it can train normally with many epochs ,but it will crash when using the ABC_D dataset , no error just stop at

image
Cherryjingyao commented 7 months ago

I also run the evaluate code withe pretrainde models python hulc/evaluation/evaluate_policy_ori.py --dataset_path /data/calvin/task_ABC_D --train_folder ./checkpoints/HULC_ABC_D How to get the video as output .i can't find some related params

lukashermann commented 7 months ago

I suggest you increase the batch size, we used 8 NVIDIA RTX 2080 ti with only 12 GB of memory per GPU, so if you have A100, you can easily increase the batch size. Since you use 4 GPUs in your setup, you could start by using batch size 64 if you want to have the same effective batch size as we had in our experiments, but feel free to experiment with increasing it more. Also, you can try to use more dataloading workers by setting datamodule.datasets.vision_dataset.num_workers=4 and datamodule.datasets.lang_dataset.num_workers=4. Using the shared memory dataloader further speeds up training, but you need enough shared memory on your machine.

We trained our models for 30 epochs on 8 GPUs which took around 1 week (depending on the dataset).

I also run the evaluate code withe pretrainde models python hulc/evaluation/evaluate_policy_ori.py --dataset_path /data/calvin/task_ABC_D --train_folder ./checkpoints/HULC_ABC_D How to get the video as output .i can't find some related params

The code currently doesn't implement writing the video to a file, you can visualize it by setting --debug. However, it should be a straightforward modification to save the video output.

Cherryjingyao commented 7 months ago

Thanks for your suggestion , I can run the code normally with num_workers=4 and batch_size =64 (although limited by the memory the speed is still slow) After running one epoch I found nothing output , where the trained model saved ,and the save interval ,which params related to this .( I'm not familiar with the use of hydra ) Again thanks for your answering

lukashermann commented 7 months ago

By default, it saves the model every epoch. If you didn't set the log_dir command line argument, then it creates a folder runs in the hulc directory, where all the runs are saved.

Cherryjingyao commented 7 months ago

I got it thanks for answering!

lukashermann commented 7 months ago

in order to make the rollouts work, did you have a look at this issue ?