Open tll1945-eng opened 9 months ago
My 4090 GPU has the same error too, the computer will crash after training up to 2000 epochs, i have to reset the computer to get it restart.
I encounted this problem too.
Exception ignored in:
My GPU information is as follows
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 3090 Off | 00000000:B1:00.0 Off | N/A | | 37% 37C P8 24W / 350W | 0MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Solved. Just request more ARM like 48GB.
This is the same as issue 36. See
我在阿里云上租用了一块V100,当diffusion_policy被安装到阿里云之后,按照diffusion_policy中给出的说明,在Running for a single seed方式下,执行 python train.py --config-dir=. --config-name=image_pusht_diffusion_policycnn.yaml training.seed=42 training.device=cuda:0 hydra.run.dir='data/outputs/${now:%Y.%m.%d}/${now:%H.%M.%S}${name}_${task_name}' 指令时,程序只能正常运行一个批次的训练。当进行完一个批次的训练以后,计算机调用gym项目中的async_vector_env.py文件里的reset_async函数时,出现了崩溃现象。是不是在pusht_image_runner.py文件中的run(self, policy: BaseImagePolicy)函数里,一些语句写错了,从而引发了程序执行的异常,还要把源代码修改修改才能正常进行?或者说,是不是单独一块V100执行不了diffusion_policy,从而引发了上面所说的程序执行异常?