Open Gao3chen opened 4 months ago
we train the unet with batch size 32 on A6000(48G). If your GPU memory is limited, you can reduce batch size. You also can download our model to test.
感谢!
It is not normal that cannot trained with bs 2 on 3090. Our model can train on 2080ti 11g with bs 2. Let me review the code. Please wait some time.
when i run train.sh, the error occurs:
Traceback (most recent call last):
File "/opt/data/private/2024.5.29/GLAD/main.py", line 612, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
作者,你好!当我运行train.sh时,我将bs调至2仍然会爆显存。我使用的是一张24G的3090,请问你用的是什么显卡? please refer the screenshot. 11g is enough for bs 2. Maybe there are some problems in your gpu or environment.
when i run train.sh, the error occurs: Traceback (most recent call last): File "/opt/data/private/2024.5.29/GLAD/main.py", line 612, in main(args, class_name) File "/opt/data/private/2024.5.29/GLAD/main.py", line 505, in main loss, global_step = train_one_epoch(accelerator, File "/opt/data/private/2024.5.29/GLAD/main.py", line 222, in train_one_epoch accelerator.backward(loss) File "/opt/conda/envs/sd_X/lib/python3.9/site-packages/accelerate/accelerator.py", line 1745, in backward loss.backward(kwargs) File "/opt/conda/envs/sd_X/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/opt/conda/envs/sd_X/lib/python3.9/site-packages/torch/autograd/init**.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: invalid argument Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.
This seems to be a problem about CUDA. I do not know how to solve it. Maybe you can search it on Baidu or Google.
作者,你好!当我运行train.sh时,我将bs调至2仍然会爆显存。我使用的是一张24G的3090,请问你用的是什么显卡?