Training not working because of RuntimeError

AI-Guru commented 2 years ago

Describe the bug

Hi,

unfortunately, I could not run the training due to an error in the scheduler.

Below you will find the error log.

Best, Tristan

Reproduction

accelerate launch train_unconditional.py \
  --dataset_name="huggan/flowers-102-categories" \
  --resolution=64 \
  --output_dir="ddpm-ema-flowers-64" \
  --train_batch_size=16 \
  --num_epochs=100 \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-4 \
  --lr_warmup_steps=500 \
  --mixed_precision=no

Logs

The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_cpu_threads_per_process` was set to `12` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.11) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Using custom data configuration huggan--flowers-102-categories-2ab3d0588f2a8da7
Reusing dataset parquet (/home/hordak/.cache/huggingface/datasets/huggan___parquet/huggan--flowers-102-categories-2ab3d0588f2a8da7/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Epoch 0:   0%|                                                                                                                                                                                | 0/512 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train_unconditional.py", line 242, in <module>
    main(args)
  File "train_unconditional.py", line 134, in main
    noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps)
  File "/home/hordak/.local/lib/python3.8/site-packages/diffusers/schedulers/scheduling_ddpm.py", line 183, in add_noise
    sqrt_alpha_prod = self.alphas_cumprod[timesteps] ** 0.5
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
Epoch 0:   0%|                                                                                                                                                                                | 0/512 [00:00<?, ?it/s]
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.11) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Traceback (most recent call last):
  File "/home/hordak/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/hordak/.local/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/hordak/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/home/hordak/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_unconditional.py', '--dataset_name=huggan/flowers-102-categories', '--resolution=64', '--output_dir=ddpm-ema-flowers-64', '--train_batch_size=16', '--num_epochs=100', '--gradient_accumulation_steps=1', '--learning_rate=1e-4', '--lr_warmup_steps=500', '--mixed_precision=no']' returned non-zero exit status 1.


### System Info

```shell
diffusers 0.2.2
Python 3.8.10
CUDA Version: 11.7

pcuenca commented 2 years ago

Hi Tristan!

Unfortunately I'm unable to replicate this problem. Do you remember if you installed diffusers using pip, or if you downloaded a specific version from github? If you don't and wouldn't mind sharing the output from pip freeze, that would be helpful too. Also, can you try to launch training without accelerate, just to see if it's a factor? You need to use the same command you pasted above, only using python instead of accelerate launch:

python train_unconditional.py \
  --dataset_name="huggan/flowers-102-categories" \
  --resolution=64 \
  --output_dir="ddpm-ema-flowers-64" \
  --train_batch_size=16 \
  --num_epochs=100 \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-4 \
  --lr_warmup_steps=500 \
  --mixed_precision=no

AI-Guru commented 2 years ago

Thanks! Thanks for your time!

Looking at the error message, I get the impression that it could be a cuda vs cpu issue.

I ran the script with python. The error is the same.

I deinstalled diffusers and reinstalled it like this:

pip install diffusers[training]

This did not help.

Here is the pip freeze:

absl-py==1.2.0
accelerate==0.12.0
aiohttp==3.8.1
aiosignal==1.2.0
appdirs==1.4.4
apt-clone==0.2.1
apturl==0.5.2
asttokens==2.0.8
astunparse==1.6.3
async-timeout==4.0.2
attrs==22.1.0
audioread==3.0.0
Automat==20.2.0
Babel==2.10.3
backcall==0.2.0
blinker==1.4
bokeh==2.4.3
Brlapi==0.7.0
cachetools==5.2.0
certifi==2019.11.28
cffi==1.15.1
chardet==3.0.4
charset-normalizer==2.1.1
click==8.1.3
colorama==0.4.3
command-not-found==0.3
constantly==15.1.0
cryptography==2.8
cssselect==1.1.0
cupshelpers==1.0
cycler==0.11.0
datasets==2.4.0
dbus-python==1.2.16
decorator==5.1.1
deepspeed==0.7.0
defer==1.0.6
diffusers==0.2.2
dill==0.3.5.1
distro==1.4.0
distro-info===0.23ubuntu1
dnspython==2.2.1
email-validator==1.2.1
entrypoints==0.3
etils==0.7.1
executing==0.10.0
filelock==3.8.0
Flask==2.2.2
Flask-BabelEx==0.9.4
Flask-Login==0.6.2
Flask-Mail==0.9.1
Flask-Principal==0.4.0
Flask-Security==3.0.0
Flask-WTF==1.0.1
flatbuffers==1.12
fonttools==4.36.0
frozenlist==1.3.1
fsspec==2022.7.1
gast==0.4.0
google-auth==2.10.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.56.4
grpcio==1.47.0
h5py==3.7.0
hjson==3.1.0
httplib2==0.14.0
huggingface-hub==0.8.1
hyperlink==21.0.0
idna==2.8
importlib-metadata==4.12.0
importlib-resources==5.9.0
incremental==21.3.0
intervaltree==3.1.0
ipython==8.4.0
itemadapter==0.7.0
itemloaders==1.0.4
itsdangerous==2.1.2
jedi==0.18.1
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.1.0
keras==2.9.0
Keras-Preprocessing==1.1.2
keyring==18.0.1
kiwisolver==1.4.4
language-selector==0.1
launchpadlib==1.10.13
lazr.restfulclient==0.14.2
lazr.uri==1.0.3
libclang==14.0.6
librosa==0.9.2
llvmlite==0.39.0
louis==3.12.0
lxml==4.9.1
macaroonbakery==1.3.1
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib==3.5.3
matplotlib-inline==0.1.6
mido==1.2.10
modelcards==0.1.6
multidict==6.0.2
multiprocess==0.70.13
netifaces==0.10.4
ninja==1.10.2.3
note-seq==0.0.3
numba==0.56.0
numpy==1.20.3
oauthlib==3.1.0
olefile==0.46
opt-einsum==3.3.0
packaging==21.3
PAM==0.4.2
pandas==1.4.3
parsel==1.6.0
parso==0.8.3
passlib==1.7.4
pexpect==4.6.0
pickleshare==0.7.5
Pillow==9.2.0
pooch==1.6.0
pretty-midi==0.2.9
promise==2.3
prompt-toolkit==3.0.30
Protego==0.2.1
protobuf==3.19.4
psutil==5.9.1
pure-eval==0.2.2
py-cpuinfo==8.0.0
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycairo==1.16.2
pycparser==2.21
pycups==1.9.73
pydantic==1.9.2
PyDispatcher==2.0.5
pydub==0.25.1
pyFluidSynth==1.3.1
Pygments==2.13.0
PyGObject==3.36.0
PyICU==2.4.2
PyJWT==1.7.1
pymacaroons==0.13.0
PyNaCl==1.3.0
pyOpenSSL==22.0.0
pyparsing==3.0.9
pyRFC3339==1.1
python-apt==2.0.0+ubuntu0.20.4.7
python-dateutil==2.8.2
python-debian===0.1.36ubuntu1
pytz==2022.2.1
pyxdg==0.26
PyYAML==5.3.1
queuelib==1.6.2
regex==2022.8.17
reportlab==3.5.34
requests==2.22.0
requests-file==1.5.1
requests-oauthlib==1.3.1
requests-unixsocket==0.2.0
resampy==0.4.0
responses==0.18.0
rsa==4.9
scikit-learn==1.1.2
scipy==1.9.0
Scrapy==2.6.2
screen-resolution-extra==0.0.0
SecretStorage==2.3.1
service-identity==21.1.0
simplejson==3.16.0
six==1.14.0
sortedcontainers==2.4.0
SoundFile==0.10.3.post1
speaklater==1.3
ssh-import-id==5.10
stack-data==0.4.0
svgwrite==1.4.3
systemd-python==234
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.9.1
tensorflow-datasets==4.6.0
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.26.0
tensorflow-metadata==1.9.0
termcolor==1.1.0
threadpoolctl==3.1.0
tldextract==3.3.1
tokenizers==0.12.1
toml==0.10.2
torch==1.13.0.dev20220819+cu116
torchaudio==0.12.1
torchvision==0.13.1
tornado==6.2
tqdm==4.64.0
traitlets==5.3.0
transformers==4.21.1
Twisted==22.4.0
typing-extensions==4.3.0
ubuntu-advantage-tools==27.9
ubuntu-drivers-common==0.0.0
ufw==0.36
unattended-upgrades==0.1
urllib3==1.26.11
w3lib==2.0.1
wadllib==1.3.3
wcwidth==0.2.5
Werkzeug==2.2.2
wrapt==1.14.1
WTForms==3.0.1
xkit==0.0.0
xxhash==3.0.0
yarl==1.8.1
zipp==3.8.1
zope.interface==5.4.0

pcuenca commented 2 years ago

Hi Tristan,

It looks like you are using a nightly version of PyTorch, instead of a release one. I installed one in my system and got the same error as you. However, uninstalling it and reinstalling the stable version worked fine for me. Unless you need some new features of the nightly version, I recommend you do the same.

You can get the install command for your system from https://pytorch.org/get-started/locally/. For reference, this is the one I used in my virtual environment:

pip install torch --extra-index-url https://download.pytorch.org/whl/cu116

AI-Guru commented 2 years ago

Thanks a lot! That really did the trick! You are the best!

huggingface / diffusers