sail-sg / MDT

Masked Diffusion Transformer is the SOTA for image synthesis. (ICCV 2023)
Apache License 2.0
500 stars 35 forks source link

In MDTv2 code, model = th.compile(model) in train_utils.py #23

Closed EvilicLufas closed 5 months ago

EvilicLufas commented 6 months ago

The compile function is allowed only in pytorch>=2.0 while the adan repo is compiled in pytorch 1.13, and in the setup phase this repo is using pytorch 1.13, I am confused how this all can be well combined and fixed?

`AttributeErrorAttributeError: : module 'torch' has no attribute 'compile'module 'torch' has no attribute 'compile'

AttributeError: module 'torch' has no attribute 'compile' Traceback (most recent call last): File "scripts/image_train.py", line 119, in main() File "scripts/image_train.py", line 64, in main TrainLoop(`

gasvn commented 6 months ago

Thanks for the feedback, adan should support pytorch >2.0. Can you try to use pytorch > 2.0? Please let me know if you find any further issue.

EvilicLufas commented 6 months ago

Thanks for your quick response!Actually I was using pytorch >2.0 previously, and the Adan keep going wrong when training MDTv2 (no matter how many times I reinstall Adan with export FORCE_CUDA=1) , then I look back in Readme and follows its Pytorch version, then the first issue appeared.

[rank5]: ImportError: /public/ProgramFiles/anaconda3/envs/diffMAEv2/lib/python3.8/site-packages/fused_adan.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK3c1010TensorImpl18compute_contiguousEv [rank5]: This could be caused by not having compiled the CUDA extension during package installation. Please try to re-install the package with the environment flag FORCE_CUDA=1 set. 344650ddd597:1066:1170 [0] NCCL INFO [Service thread] Connection closed by localRank 0 344650ddd597:1067:1166 [1] NCCL INFO [Service thread] Connection closed by localRank 0 344650ddd597:1067:1166 [1] NCCL INFO [Service thread] Connection closed by localRank 1 344650ddd597:1068:1168 [2] NCCL INFO [Service thread] Connection closed by localRank 1 344650ddd597:1066:1170 [0] NCCL INFO [Service thread] Connection closed by localRank 1 344650ddd597:1072:1164 [6] NCCL INFO [Service thread] Connection closed by localRank 7 344650ddd597:1073:1167 [7] NCCL INFO [Service thread] Connection closed by localRank 7 344650ddd597:1070:1169 [4] NCCL INFO [Service thread] Connection closed by localRank 4 344650ddd597:1071:1165 [5] NCCL INFO [Service thread] Connection closed by localRank 4 344650ddd597:1071:1165 [5] NCCL INFO [Service thread] Connection closed by localRank 6 344650ddd597:1072:1164 [6] NCCL INFO [Service thread] Connection closed by localRank 6 344650ddd597:1073:1167 [7] NCCL INFO [Service thread] Connection closed by localRank 6 344650ddd597:1068:1168 [2] NCCL INFO [Service thread] Connection closed by localRank 3 344650ddd597:1069:1163 [3] NCCL INFO [Service thread] Connection closed by localRank 3 344650ddd597:1070:1169 [4] NCCL INFO [Service thread] Connection closed by localRank 5 344650ddd597:1072:1164 [6] NCCL INFO [Service thread] Connection closed by localRank 5 344650ddd597:1071:1165 [5] NCCL INFO [Service thread] Connection closed by localRank 5 344650ddd597:1067:1166 [1] NCCL INFO [Service thread] Connection closed by localRank 2 344650ddd597:1068:1168 [2] NCCL INFO [Service thread] Connection closed by localRank 2 344650ddd597:1069:1163 [3] NCCL INFO [Service thread] Connection closed by localRank 2 344650ddd597:1073:1181 [0] NCCL INFO comm 0x55a43ae84c60 rank 7 nranks 8 cudaDev 7 busId a4000 - Abort COMPLETE 344650ddd597:1066:1183 [0] NCCL INFO comm 0x55f3526a0140 rank 0 nranks 8 cudaDev 0 busId 4f000 - Abort COMPLETE 344650ddd597:1072:1180 [0] NCCL INFO comm 0x55c400772320 rank 6 nranks 8 cudaDev 6 busId a0000 - Abort COMPLETE 344650ddd597:1071:1186 [0] NCCL INFO comm 0x562bc9d77530 rank 5 nranks 8 cudaDev 5 busId 9d000 - Abort COMPLETE 344650ddd597:1070:1192 [0] NCCL INFO comm 0x562d834a4da0 rank 4 nranks 8 cudaDev 4 busId 9c000 - Abort COMPLETE 344650ddd597:1067:1185 [0] NCCL INFO comm 0x55a5f7d0a910 rank 1 nranks 8 cudaDev 1 busId 50000 - Abort COMPLETE 344650ddd597:1068:1179 [0] NCCL INFO comm 0x55c4f7123920 rank 2 nranks 8 cudaDev 2 busId 53000 - Abort COMPLETE 344650ddd597:1069:1184 [0] NCCL INFO comm 0x55c101020900 rank 3 nranks 8 cudaDev 3 busId 57000 - Abort COMPLETE

EvilicLufas commented 6 months ago

Thanks for the feedback, adan should support pytorch >2.0. Can you try to use pytorch > 2.0? Please let me know if you find any further issue

While using pytorch > 2.0 / 8 A100 / CUDA 12.1, The Adan keep going wrong no matter how many times I reinstall Adan with export FORCE_CUDA=1,the compile goes well and error appeared during training phase. (The MDTv1 runs well and congrats to your excellent work!). The detailed info is below:

[rank5]: ImportError: /public/ProgramFiles/anaconda3/envs/diffMAEv2/lib/python3.8/site-packages/fused_adan.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK3c1010TensorImpl18compute_contiguousEv [rank5]: This could be caused by not having compiled the CUDA extension during package installation. Please try to re-install the package with the environment flag FORCE_CUDA=1 set.

XingyuXie commented 6 months ago

@EvilicLufas Sorry for causing the potential problem, we have updated the setup.py file for Adan, you may try it now.

EvilicLufas commented 5 months ago

@EvilicLufas Sorry for causing the potential problem, we have updated the setup.py file for Adan, you may try it now.

Thanks for your kind support! While using the updated new Adan repo, error is : building 'fused_adan' extension /public/ProgramFiles/anaconda3/envs/diffMAEv2/bin/nvcc -I/public/ProgramFiles/anaconda3/envs/diffMAEv2/lib/python3.8/site-packages/torch/include -I/public/ProgramFiles/anaconda3/envs/diffMAEv2/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/public/ProgramFiles/anaconda3/envs/diffMAEv2/lib/python3.8/site-packages/torch/include/TH -I/public/ProgramFiles/anaconda3/envs/diffMAEv2/lib/python3.8/site-packages/torch/include/THC -I/public/ProgramFiles/anaconda3/envs/diffMAEv2/include -I/public/ProgramFiles/anaconda3/envs/diffMAEv2/include/python3.8 -c ./fused_adan/fused_adan_kernel.cu -o build/temp.linux-x86_64-cpython-38/./fused_adan/fused_adan_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=fused_adan -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_35,code=compute_35 -gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_37,code=sm_37 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_52,code=compute_52 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_53,code=sm_53 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 -std=c++17 nvcc fatal : Unsupported gpu architecture 'compute_35' error: command '/public/ProgramFiles/anaconda3/envs/diffMAEv2/bin/nvcc' failed with exit code 1

Then I modified the setup.py in Adan with if "--unfused" in sys.argv: print("Building unfused version of adan") sys.argv.remove("--unfused") elif build_cuda_ext: cuda_extension = CUDAExtension( 'fused_adan', sources=[ 'fused_adan/pybind_adan.cpp', './fused_adan/fused_adan_kernel.cu', './fused_adan/multi_tensor_adan_kernel.cu' ], extra_compile_args={ 'cxx': [], 'nvcc': [ '-gencode=arch=compute_50,code=sm_50', ] } ) Then the compiling phase goes well, and the same error in the beginning appeared : fused_adan.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK3c1010TensorImpl18compute_contiguousEv (The readme file in fused Adan said it compiled passed on Pytorch 1.13.1 and this MDTv2 is using function in Pytorch>2.0 , can Adan really support that? thanks again!)

gasvn commented 5 months ago

It seems you are using a device that is not supported any more by CUDA. What is the exact GPU type you are using? https://forums.developer.nvidia.com/t/nvcc-fatal-unsupported-gpu-architecture-compute-35/247815

EvilicLufas commented 5 months ago

It seems you are using a device that is not supported any more by CUDA. What is the exact GPU type you are using? https://forums.developer.nvidia.com/t/nvcc-fatal-unsupported-gpu-architecture-compute-35/247815

There is a rare opportunity that the A100 or A800 GPUs are not supported by CUDA, and I believe the link you sent (I have searched before) said that it is the CUDA version issue while compiling (so the user solve it by Removing 3.5 in cmakelist), as you can see then I modified the setup.py in Adan with code like 'nvcc': [ '-gencode=arch=compute_50,code=sm_50', ] } ) then the compiling phase goes well, and the same error in the beginning appeared : fused_adan.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK3c1010TensorImpl18compute_contiguousEv

EvilicLufas commented 5 months ago

[main !3 ?21] >> nvidia-smi Tue Mar 12 10:55:10 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-PCIE-40GB On | 00000000:4F:00.0 Off | 0 | | N/A 32C P0 34W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-PCIE-40GB On | 00000000:50:00.0 Off | 0 | | N/A 34C P0 38W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-PCIE-40GB On | 00000000:53:00.0 Off | 0 | | N/A 34C P0 38W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-PCIE-40GB On | 00000000:57:00.0 Off | 0 | | N/A 34C P0 37W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-PCIE-40GB On | 00000000:9C:00.0 Off | 0 | | N/A 32C P0 33W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-PCIE-40GB On | 00000000:9D:00.0 Off | 0 | | N/A 34C P0 36W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-PCIE-40GB On | 00000000:A0:00.0 Off | 0 | | N/A 33C P0 35W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-PCIE-40GB On | 00000000:A4:00.0 Off | 0 | | N/A 32C P0 35W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

gasvn commented 5 months ago

@XingyuXie is still trying to find the problem on your case. Just in case, have you checked the cuda nvcc version on your local server aligns with the pytorch and cudatoolkit version on your conda env?

EvilicLufas commented 5 months ago

@XingyuXie is still trying to find the problem on your case. Just in case, have you checked the cuda nvcc version on your local server aligns with the pytorch and cudatoolkit version on your conda env?

For now the alignments in envs are fine. Can I get more detailed info of your envs? (since I'm using exactly the GPU server in the paper so maybe I create the envs again) Cause the readme.md in MDTv2 is still using Pytorch 1.13 (which conflicts th.compile in Pytorch>=2.0) so maybe the setup phase is not updated? By the way the MDTv1 (using another env) runs smoothly and I have done some work on that so I wish the MDTv2 can run smoothly as well.

gasvn commented 5 months ago

Thanks for the reminder. MDTv2 requires torch>2.0 to support the torch.compile, I will update the readme. Unfortunately I no longer have the access to the server used for training mdtv2. The only change I made for MDTv2 based on MDTv1 is to 1) install torch 2.0 2) install adan.

EvilicLufas commented 5 months ago

Thanks for the reminder. MDTv2 requires torch>2.0 to support the torch.compile, I will update the readme. Unfortunately I no longer have the access to the server used for training mdtv2. The only change I made for MDTv2 based on MDTv1 is to 1) install torch 2.0 2) install adan.

Yeah at that time I copied the MDTv1 env then do the exact same things and went wrong, maybe now I try another time. Meanwhile please check if the Adan repo (especially fused_adan) can be improved again? Since the readme in it said it passes on Pytorch 1.13.1 & CUDA 11.6+, hope there is some good news and thanks for the great work.

XingyuXie commented 5 months ago

@EvilicLufas Sorry for the problems with Adan.

image

EvilicLufas commented 5 months ago

@EvilicLufas Sorry for the problems with Adan.

  • I have tried to install Adan in a totally new environment, and everything is okay.

image

  • You can skip the step to install Adan. Just put the adan.py in the folder and import Adan from this file. Then you can use Adan directly except for its fused version. Anyway, most of the users won't use the fused version. So, don't worry about the fused Adan.

Thank you! Combined with your suggestions and by using conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia, the env problem has been solved, I guess the previous error comes with pytorch version is a bit high, anyway thanks for your effort. Great works!