Closed zcdliuwei closed 1 year ago
Is xformers installed and enabled?
No I execute your training code completely according to your requirements file, except for xformers. Because I always fail to install xformers when I execute pip install xformers, can you tell me how to install xformers? My server environment is shown in the figure above.
Thank you very much
You need to install xformers to avoid OOM. For installation, please follow the official instruction.
I executed pip install xformers==0.16rc425 to install xformers, and torch==1.13.1, torch vision==0.14.1, and then encountered the following error during training:
It looks like you're using distributed training, which is not supported now. To avoid using multiple GPUs, you may specify one GPU device by export CUDA_VISIBLE_DEVICES=GPU_ID
.
I specified a visible device like below CUDA_VISIBLE_DEVICES=0 accelerate launch train_tuneavideo.py --config="configs/man-surfing.yaml" but I still encountered the same error
can you try using the same environment as written in the requirements.txt
? in particular, torch==1.12.1
.
I have reinstalled the environment and can run normally, but the effect is not as good as that in your readmen file. Can you give me any suggestions for improvement
The following is the GIF I generated
It looks like you're using distributed training, which is not supported now. To avoid using multiple GPUs, you may specify one GPU device by
export CUDA_VISIBLE_DEVICES=GPU_ID
.
Why doesn't it support multi-gpu training, will it be difficult to modify the code to multi-gpu training?
It looks like you're using distributed training, which is not supported now. To avoid using multiple GPUs, you may specify one GPU device by
export CUDA_VISIBLE_DEVICES=GPU_ID
.Why doesn't it support multi-gpu training, will it be difficult to modify the code to multi-gpu training?
I tried multi-gpu training, but results are bad
RuntimeError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 23.70 GiB total capacity; 8.31 GiB already allocated; 254.06 MiB free; 8.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My server environment is: