GPU consumption - Githubissues

Gao3chen commented 4 months ago

作者，你好！当我运行train.sh时，我将bs调至2仍然会爆显存。我使用的是一张24G的3090，请问你用的是什么显卡？

hyao1 commented 4 months ago

we train the unet with batch size 32 on A6000(48G). If your GPU memory is limited, you can reduce batch size. You also can download our model to test.

Gao3chen commented 4 months ago

感谢！

hyao1 commented 4 months ago

It is not normal that cannot trained with bs 2 on 3090. Our model can train on 2080ti 11g with bs 2. Let me review the code. Please wait some time.

Gao3chen commented 4 months ago

when i run train.sh, the error occurs: Traceback (most recent call last): File "/opt/data/private/2024.5.29/GLAD/main.py", line 612, in main(args, class_name) File "/opt/data/private/2024.5.29/GLAD/main.py", line 505, in main loss, global_step = train_one_epoch(accelerator, File "/opt/data/private/2024.5.29/GLAD/main.py", line 222, in train_one_epoch accelerator.backward(loss) File "/opt/conda/envs/sd_X/lib/python3.9/site-packages/accelerate/accelerator.py", line 1745, in backward loss.backward(**kwargs) File "/opt/conda/envs/sd_X/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/opt/conda/envs/sd_X/lib/python3.9/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: invalid argument Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

hyao1 commented 4 months ago

作者，你好！当我运行train.sh时，我将bs调至2仍然会爆显存。我使用的是一张24G的3090，请问你用的是什么显卡？ please refer the screenshot. 11g is enough for bs 2. Maybe there are some problems in your gpu or environment.

hyao1 commented 4 months ago

when i run train.sh, the error occurs: Traceback (most recent call last): File "/opt/data/private/2024.5.29/GLAD/main.py", line 612, in main(args, class_name) File "/opt/data/private/2024.5.29/GLAD/main.py", line 505, in main loss, global_step = train_one_epoch(accelerator, File "/opt/data/private/2024.5.29/GLAD/main.py", line 222, in train_one_epoch accelerator.backward(loss) File "/opt/conda/envs/sd_X/lib/python3.9/site-packages/accelerate/accelerator.py", line 1745, in backward loss.backward(kwargs) File "/opt/conda/envs/sd_X/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/opt/conda/envs/sd_X/lib/python3.9/site-packages/torch/autograd/init**.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: invalid argument Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This seems to be a problem about CUDA. I do not know how to solve it. Maybe you can search it on Baidu or Google.

hyao1 / GLAD

GPU consumption #2