mosaicml / diffusion

Apache License 2.0
664 stars 66 forks source link

Error doing composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-512.yaml #29

Open wangmiaowei opened 1 year ago

wangmiaowei commented 1 year ago

composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-512.yaml [2023-06-13 20:29:52,077][composer.utils.reproducibility][INFO] - Setting seed to 17 Error executing job with overrides: [] Error in call to target 'diffusion.models.models.stable_diffusion_2': TypeError("UNet2DConditionModel.init() got an unexpected keyword argument 'dual_cross_attention'") full_key: model

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. ERROR:composer.cli.launcher:Rank 3 crashed with exit code 1. Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately. Global rank 3 (PID 40553) exited with code 1 ----------Begin global rank 3 STDOUT---------- [2023-06-13 20:29:52,032][composer.utils.reproducibility][INFO] - Setting seed to 17

----------End global rank 3 STDOUT---------- ----------Begin global rank 3 STDERR---------- Error executing job with overrides: [] Error in call to target 'diffusion.models.models.stable_diffusion_2': TypeError("UNet2DConditionModel.init() got an unexpected keyword argument 'dual_cross_attention'") full_key: model

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

----------End global rank 3 STDERR---------- ERROR:composer.cli.launcher:Global rank 0 (PID 40550) exited with code -15

Landanjs commented 1 year ago

Can you provide the output of pip list from the machine you are trying to run this on?

wangmiaowei commented 1 year ago

WARNING: Ignoring invalid distribution -orch (/root/envs/py39_dl/lib/python3.9/site-packages) Package Version Editable project location


absl-py 1.4.0 accelerate 0.19.0 addict 2.4.0 aiofiles 23.1.0 aiohttp 3.8.4 aiosignal 1.3.1 albumentations 1.3.0 altair 4.2.2 antlr4-python3-runtime 4.9.3 anyio 3.7.0 appdirs 1.4.4 argcomplete 3.1.1 arrow 1.2.3 asttokens 2.2.1 async-timeout 4.0.2 attrs 23.1.0 azure-core 1.27.1 azure-storage-blob 12.16.0 backcall 0.2.0 backoff 2.2.1 bcrypt 4.0.1 beautifulsoup4 4.12.2 bitsandbytes 0.39.0 blinker 1.6.2 boto3 1.26.156 botocore 1.29.156 braceexpand 0.1.7 Brotli 1.0.9 brotlipy 0.7.0 cachetools 5.3.0 certifi 2023.5.7 cffi 1.15.1 chardet 5.1.0 charset-normalizer 2.0.4 cholespy 0.1.6 circuitbreaker 1.4.0 click 8.1.3 clip 1.0 cmake 3.26.3 colorlog 6.7.0 comm 0.1.3 ConfigArgParse 1.5.3 contourpy 1.0.7 coolname 2.2.0 cos-python-sdk-v5 1.9.24 crcmod 1.7 cryptography 39.0.1 cubvh 0.1.0 cycler 0.11.0 Cython 0.29.34 dash 2.10.0 dash-core-components 2.0.0 dash-html-components 2.0.0 dash-table 5.0.0 datasets 2.12.0 debugpy 1.6.7 decorator 5.1.1 deepspeed 0.9.2 diffusers 0.17.0.dev0 diffusion 0.0.1 /root/programs_wmw/sd_train/diffusion-main dill 0.3.6 docker 6.1.3 docker-pycreds 0.4.0 docopt 0.6.2 dominate 2.7.0 easydict 1.10 einops 0.6.1 entrypoints 0.4 exceptiongroup 1.1.1 executing 1.2.0 fastapi 0.95.2 fastjsonschema 2.17.1 ffmpy 0.3.0 filelock 3.12.0 fire 0.5.0 Flask 1.1.2 fonttools 4.39.4 frozenlist 1.3.3 fsspec 2023.5.0 ftfy 6.1.1 future 0.18.3 gdown 4.7.1 gitdb 4.0.10 GitPython 3.1.31 glfw 2.5.9 google-auth 2.18.0 google-auth-oauthlib 1.0.0 gql 3.4.1 gradio 3.32.0 gradio_client 0.2.5 graphql-core 3.2.3 grpcio 1.54.2 h11 0.14.0 hjson 3.1.0 HTML4Vision 0.4.3 httpcore 0.17.2 httpx 0.24.1 huggingface-hub 0.14.1 hydra-colorlog 1.2.0 hydra-core 1.3.2 idna 3.4 igl 2.2.1 imageio 2.28.1 imageio-ffmpeg 0.4.8 importlib-metadata 6.6.0 importlib-resources 5.12.0 ipykernel 6.23.1 ipython 8.13.2 ipywidgets 8.0.6 isodate 0.6.1 itsdangerous 2.0.1 jedi 0.18.2 Jinja2 3.0.3 jmespath 1.0.1 joblib 1.2.0 jsonschema 4.17.3 jupyter_client 8.2.0 jupyter_core 5.3.0 jupyterlab-widgets 3.0.7 kiwisolver 1.4.4 kornia 0.6.12 lazy_loader 0.2 lightning-utilities 0.8.0 linkify-it-py 2.0.2 lit 16.0.3 lpips 0.1.4 Markdown 3.4.3 markdown-it-py 2.2.0 MarkupSafe 2.1.2 matplotlib 3.7.1 matplotlib-inline 0.1.6 mdit-py-plugins 0.3.3 mdurl 0.1.2 mosaicml 0.15.0 /root/programs_wmw/pkgs/composer-dev mosaicml-cli 0.4.10 mosaicml-streaming 0.5.1 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 mypy-extensions 1.0.0 nbformat 5.7.0 nest-asyncio 1.5.6 networkx 3.1 ninja 1.11.1 numpy 1.22.3 nvdiffrast 0.3.0 /root/envs/py39_dl/lib/python3.9/site-packages/nvdiffrast-0.3.0-py3.9.egg nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-ml-py3 7.352.0 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 oauthlib 3.2.2 oci 2.104.2 omegaconf 2.3.0 open3d 0.17.0 opencv-python 4.7.0.72 opencv-python-headless 4.7.0.72 orjson 3.8.14 packaging 22.0 pandas 2.0.1 paramiko 3.2.0 parso 0.8.3 pathtools 0.1.2 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.4.0 pip 23.1.2 pipreqs 0.4.13 platformdirs 3.5.1 plotly 5.14.1 prometheus-client 0.8.0 prompt-toolkit 3.0.38 protobuf 3.20.3 psutil 5.9.5 ptyprocess 0.7.0 pudb 2022.1.3 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 12.0.0 pyasn1 0.5.0 pyasn1-modules 0.3.0 pycparser 2.21 pycryptodome 3.17 pydantic 1.10.8 pydeck 0.8.1b0 pyDeprecate 0.3.1 pydub 0.25.1 PyGLM 2.7.0 Pygments 2.15.1 pyk4a 1.5.0 pymeshlab 2022.2.post3 Pympler 1.0.1 PyNaCl 1.5.0 PyOpenGL 3.1.6 pyOpenSSL 23.0.0 pyparsing 3.0.9 pyquaternion 0.9.9 pyre-extensions 0.0.23 pyrsistent 0.19.3 PySocks 1.7.1 python-dateutil 2.8.2 python-multipart 0.0.6 python-snappy 0.6.1 pytorch-lightning 1.4.2 pytorch-ranger 0.1.1 pytz 2023.3 PyWavelets 1.4.1 PyYAML 6.0 pyzmq 25.1.0 qudida 0.0.4 questionary 1.10.0 redis 4.5.5 regex 2023.5.5 requests 2.29.0 requests-oauthlib 1.3.1 resize-right 0.0.2 responses 0.18.0 rich 13.3.5 rsa 4.9 ruamel.yaml 0.17.32 ruamel.yaml.clib 0.2.7 s3transfer 0.6.1 safetensors 0.3.1 scikit-image 0.20.0 scikit-learn 1.2.2 scipy 1.8.1 semantic-version 2.10.0 sentry-sdk 1.25.1 setproctitle 1.3.2 setuptools 66.0.0 six 1.16.0 smmap 5.0.0 smplx 0.1.28 sniffio 1.3.0 soupsieve 2.4.1 stack-data 0.6.2 starlette 0.27.0 streamlit 1.22.0 sympy 1.12 tabulate 0.9.0 taming-transformers 0.0.1 tenacity 8.2.2 tensorboard 2.13.0 tensorboard-data-server 0.7.0 tensorboardX 2.6 termcolor 2.3.0 test-tube 0.7.5 threadpoolctl 3.1.0 tifffile 2023.4.12 tokenizers 0.13.3 toml 0.10.2 toolz 0.12.0 torch 1.13.1 torch-ema 0.3 torch-fidelity 0.3.0 torch-optimizer 0.3.0 torch-scatter 2.1.1+pt113cu117 torch-sparse 0.6.17+pt113cu117 torchaudio 0.13.1 torchdata 0.6.1 torchmetrics 0.11.4 torchtext 0.14.1 torchvision 0.14.1 tornado 6.3.2 tqdm 4.65.0 traitlets 5.9.0 transformers 4.29.1 trimesh 3.21.6 triton 2.0.0 typing_extensions 4.5.0 typing-inspect 0.8.0 tzdata 2023.3 tzlocal 5.0.1 uc-micro-py 1.0.2 urllib3 1.26.15 urwid 2.1.2 urwid-readline 0.13 uvicorn 0.22.0 validators 0.20.0 wandb 0.15.4 watchdog 3.0.0 wcwidth 0.2.6 webdataset 0.2.48 websocket-client 1.6.0 websockets 10.4 Werkzeug 1.0.1 wheel 0.38.4 widgetsnbextension 4.0.7 xatlas 0.0.7 xformers 0.0.16 xmltodict 0.13.0 xxhash 3.2.0 yarg 0.1.9 yarl 1.9.2 zipp 3.15.0 zstd 1.5.5.1

wangmiaowei commented 1 year ago

By the way: /root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}") /root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}") [2023-06-21 16:23:25,908][composer.utils.reproducibility][INFO] - Setting seed to 17 [2023-06-21 16:23:25,988][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1) Python 3.9.16 (you have 3.9.16) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details ERROR:composer.cli.launcher:Rank 6 crashed with exit code 1. Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately. Global rank 0 (PID 1594036) exited with code 143 Global rank 1 (PID 1594037) exited with code 143 ----------Begin global rank 1 STDOUT---------- [2023-06-21 16:23:25,799][composer.utils.reproducibility][INFO] - Setting seed to 17 [2023-06-21 16:23:25,863][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1) Python 3.9.16 (you have 3.9.16) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 1 STDOUT---------- ----------Begin global rank 1 STDERR---------- /root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}")

----------End global rank 1 STDERR---------- Global rank 2 (PID 1594038) exited with code 143 ----------Begin global rank 2 STDOUT---------- [2023-06-21 16:23:26,238][composer.utils.reproducibility][INFO] - Setting seed to 17 [2023-06-21 16:23:26,295][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1) Python 3.9.16 (you have 3.9.16) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 2 STDOUT---------- ----------Begin global rank 2 STDERR---------- /root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}")

----------End global rank 2 STDERR---------- Global rank 3 (PID 1594039) exited with code 143 ----------Begin global rank 3 STDOUT---------- [2023-06-21 16:23:26,143][composer.utils.reproducibility][INFO] - Setting seed to 17 [2023-06-21 16:23:26,194][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1) Python 3.9.16 (you have 3.9.16) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 3 STDOUT---------- ----------Begin global rank 3 STDERR---------- /root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}")

----------End global rank 3 STDERR---------- Global rank 4 (PID 1594040) exited with code 143 ----------Begin global rank 4 STDOUT---------- [2023-06-21 16:23:26,017][composer.utils.reproducibility][INFO] - Setting seed to 17 [2023-06-21 16:23:26,068][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1) Python 3.9.16 (you have 3.9.16) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 4 STDOUT---------- ----------Begin global rank 4 STDERR---------- /root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}")

----------End global rank 4 STDERR---------- Global rank 5 (PID 1594041) exited with code 143 ----------Begin global rank 5 STDOUT---------- [2023-06-21 16:23:25,865][composer.utils.reproducibility][INFO] - Setting seed to 17 [2023-06-21 16:23:25,922][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1) Python 3.9.16 (you have 3.9.16) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 5 STDOUT---------- ----------Begin global rank 5 STDERR---------- /root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}")

----------End global rank 5 STDERR---------- Global rank 6 (PID 1594042) exited with code 1 ----------Begin global rank 6 STDOUT---------- [2023-06-21 16:23:25,761][composer.utils.reproducibility][INFO] - Setting seed to 17 [2023-06-21 16:23:25,863][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1) Python 3.9.16 (you have 3.9.16) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 6 STDOUT---------- ----------Begin global rank 6 STDERR---------- /root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}") Error executing job with overrides: [] Error in call to target 'diffusion.models.models.stable_diffusion_2': RuntimeError('CUDA error: invalid device ordinal\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.') full_key: model

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

----------End global rank 6 STDERR---------- Global rank 7 (PID 1594043) exited with code 143 ----------Begin global rank 7 STDOUT---------- [2023-06-21 16:23:25,912][composer.utils.reproducibility][INFO] - Setting seed to 17 [2023-06-21 16:23:25,994][xformers][WARNING] - WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 1.13.1+cu117 with CUDA 1107 (you have 1.13.1) Python 3.9.16 (you have 3.9.16) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details

----------End global rank 7 STDOUT---------- ----------Begin global rank 7 STDERR---------- /root/envs/py39_dl/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}")

----------End global rank 7 STDERR---------- ERROR:composer.cli.launcher:Global rank 0 (PID 1594036) exited with code 143

Landanjs commented 1 year ago

Hello, apologies for the delay. This seems like a setup, but I can't pinpoint exactly what is going wrong.

A few questions: