Textual Inversion Training on M1 (works!)

tmm1 commented 1 year ago

WIP HERE: https://github.com/lstein/stable-diffusion/compare/development...tmm1:dev-train-m1

I started experimenting with running main.py on M1 and wanted to document some immediate issues.

Looks like we need a newer pytorch-lightning for MPS. Currently using 1.6.5 but latest is 1.7.5

However bumping it causes this error:

AttributeError: module 'pytorch_lightning.loggers' has no a ttribute 'TestTubeLogger'. Did you mean: 'NeptuneLogger'?

which is because TestTubeLogger was deprecated: https://github.com/Lightning-AI/lightning/issues/13958#issuecomment-1200780456

lstein commented 1 year ago

I started working with the training functionality last night as well and ran into problems on CUDA. The textual inversion modifications to ddpm.py seem to have adversely affected vanilla training and we'll have to do a careful comparison with the original CompViz implementation in order to isolate the conflicts.

@tmm1, have you tried main.py on M1 using any of the other (multitudinous) forks? If so, any success?

tmm1 commented 1 year ago

If there's a fork that advertises M1 training support I would be happy to try it. I have not seen one, but I have not looked much either. My understanding was that most of the M1 work was happening here.

lstein commented 1 year ago

@Any-Winter-4079, when you've finished the latest round of code tweaking, could you have a look at training? It seems to be messed up on M1.

tmm1 commented 1 year ago

I made some progress today, and was able to get through all the setup and start training to send commands to the mps backend: https://github.com/lstein/stable-diffusion/compare/development...tmm1:dev-train-m1

Currently stuck here:

...
Global seed set to 23
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Summoning checkpoint.
Traceback (most recent call last):
  File "/Users/tmm1/code/stable-diffusion/./main.py", line 946, in <module>
    trainer.fit(model, data)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1147, in _run
    self.strategy.setup(self)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 161, in setup
    self._share_information_to_prevent_deadlock()
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 396, in _share_information_to_prevent_deadlock
    self._share_pids()
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 414, in _share_pids
    pids = self.all_gather(torch.tensor(os.getpid(), device=self.root_device))
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/strategies/parallel.py", line 113, in all_gather
    return all_gather_ddp_if_available(tensor, group=group, sync_grads=sync_grads)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/utilities/distributed.py", line 219, in all_gather_ddp_if_available
    return AllGatherGrad.apply(tensor, group)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/utilities/distributed.py", line 187, in forward
    torch.distributed.all_gather(gathered_tensor, tensor, group=group)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2070, in all_gather
    work = group.allgather([tensor_list], [tensor])
RuntimeError: ProcessGroupGloo::allgather: unsupported device type mps

I'm looking to see if there's a way to turn off the distributed_backend=gloo

tmm1 commented 1 year ago

Made some more progress by changing strategy from ddp to dp (https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html)

However, it seems ImageLogger which was using testtube is doing something important and switching to dp + csvlogger is not achieving the same result.

TypeError: LatentDiffusion.on_train_batch_start() missing 1 required positional argument: 'dataloader_idx'

EDIT: Found solution in https://github.com/Lightning-AI/lightning/issues/10315

tmm1 commented 1 year ago

It is training!

Epoch 0: 11%|█▎ | 44/404 [03:59<32:37, 5.44s/it, loss=0.0784, v_num=0, train/loss_simple_step=0.00508, train/loss_vlb_step=2.81e-5, train/loss_step=0.00508, global_step=43.00]

I recall tho others saying training was broken on CUDA too in this fork, so I'm not sure if this is actually working or just appearing to. But a lot of the blockers are solved and we can get into the guts of the impl now.

EDIT: I am not seeing any of the warnings mentioned on the CUDA thread (related to batch_size)

tmm1 commented 1 year ago

Died at the end.

Everything turned to nan at some point, I don't know if that's a bad sign.

I will try to make outputs optional and exit training early to see if it works.

Epoch 0: 100%|████████████████████| 404/404 [27:34<00:00,  4.10s/it, loss=nan, v_num=0, train/loss_simple_step=nan.0, train/loss_vlb_step=nan.0, train/loss_step=nan.0, global_step=399.0]

Summoning checkpoint.
Traceback (most recent call last):
  File "/Users/tmm1/code/stable-diffusion/./main.py", line 946, in <module>
    trainer.fit(model, data)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 299, in on_advance_end
    self.trainer._call_callback_hooks("on_train_epoch_end")
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1597, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
TypeError: CUDACallback.on_train_epoch_end() missing 1 required positional argument: 'outputs'

tmm1 commented 1 year ago

Hmm got past last error but a new one now:

Epoch 0: 100%|███████████| 404/404 [16:24<00:00,  2.44s/it, loss=0.0835, v_num=0, train/loss_simple_step=0.00323, train/loss_vlb_step=1.88e-5, train/loss_step=0.00323, global_step=399.0/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:2075: LightningDeprecationWarning: `Trainer.root_gpu` is deprecated in v1.6 and will be removed in v1.8. Please use `Trainer.strategy.root_device.index` instead.
  rank_zero_deprecation(
Summoning checkpoint.
Traceback (most recent call last):
  File "/Users/tmm1/code/stable-diffusion/./main.py", line 946, in <module>
    trainer.fit(model, data)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 299, in on_advance_end
    self.trainer._call_callback_hooks("on_train_epoch_end")
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1597, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/Users/tmm1/code/stable-diffusion/main.py", line 558, in on_train_epoch_end
    torch.cuda.synchronize(trainer.root_gpu)
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/cuda/__init__.py", line 494, in synchronize
    _lazy_init()
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/cuda/__init__.py", line 211, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

tmm1 commented 1 year ago

Okay now its able to move onto epoch 1!

Average Epoch time: 472.13 seconds
Average Peak memory 0.00MiB
Epoch 1:  16%|█▌        | 63/404 [01:23<07:33,  1.33s/it, loss=0.0875, v_num=0, train/loss_simple_step=0.204, train/loss_vlb_step=0.000971, train/loss_step=0.204, global_step=462.0, train/loss_simple_epoch=0.108, train/loss_vlb_epoch=0.00118, train/loss_epoch=0.108]

tmm1 commented 1 year ago

I have a checkpoints/embeddings_gs-1600.pt now but when I try using it, the output images are black :(

tmm1 commented 1 year ago

I started fresh and by epoch 2 everything turns to nan. I think that is causing the black images?

Epoch 2: 26%|▎| 104/404 [02:11<06:20, 1.27s/it, loss=nan, v_num=0, train/loss_simple_step=nan.0, train/loss_vlb_step=nan.0, train/loss_step=nan.

cc @magnusviri @birch-san

Birch-san commented 1 year ago

when I encountered black images with k-diffusion sampler, it was due to this problem (with ±Inf):
https://github.com/pytorch/pytorch/issues/84364

fix was just to detach and clone the tensor:
https://github.com/crowsonkb/k-diffusion/commit/3e976ef2508a7173240a897d2a4f5113124f5029

if you're having NaN (rather than ±Inf), maybe that's unrelated.

I recommend to narrow down which line first introduces NaN. you can use this check to do so:

mycooltensor.isnan().any()
# returns a boolean

tmm1 commented 1 year ago

Thanks @Birch-san! I see this warning at the start of training which may be related.

/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/core/module.py:555: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor). value = torch.tensor(value, device=self.device)

Any-Winter-4079 commented 1 year ago

This looks interesting. I will have a look.

tmm1 commented 1 year ago

Thanks @Any-Winter-4079! You could use the ugly-sonic training samples along with instructions in TEXTUAL_INVERSION.md

I am going to try detect_anomaly=True as recommended on https://github.com/Lightning-AI/lightning/discussions/12137#discussioncomment-2270867

tmm1 commented 1 year ago

Caught something:

``` Sanity Checking: 0it [00:00, ?it/s]/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:225: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 20 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn( Sanity Checking DataLoader 0: 0%| | 0/2 [00:00 trainer.fit(model, data) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train self.fit_loop.run() File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance batch_output = self.batch_loop.run(kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step self.trainer._call_lightning_module_hook( File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1672, in optimizer_step optimizer.step(closure=optimizer_closure) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step return optimizer.step(closure=closure, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 113, in wrapper return func(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/optim/adamw.py", line 119, in step loss = closure() File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure closure_result = closure() File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in __call__ self._result = self.closure(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 132, in closure step_output = self._step_fn() File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 407, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values()) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook output = fn(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/strategies/dp.py", line 134, in training_step return self.model(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/overrides/data_parallel.py", line 65, in forward output = super().forward(*inputs, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/overrides/base.py", line 79, in forward output = self.module.training_step(*inputs, **kwargs) File "/Users/tmm1/code/stable-diffusion/ldm/models/diffusion/ddpm.py", line 498, in training_step loss, loss_dict = self.shared_step(batch) File "/Users/tmm1/code/stable-diffusion/ldm/models/diffusion/ddpm.py", line 1253, in shared_step loss = self(x, c) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/Users/tmm1/code/stable-diffusion/ldm/models/diffusion/ddpm.py", line 1270, in forward return self.p_losses(x, c, t, *args, **kwargs) File "/Users/tmm1/code/stable-diffusion/ldm/models/diffusion/ddpm.py", line 1475, in p_losses model_output = self.apply_model(x_noisy, t, cond) File "/Users/tmm1/code/stable-diffusion/ldm/models/diffusion/ddpm.py", line 1440, in apply_model x_recon = self.model(x_noisy, t, **cond) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/Users/tmm1/code/stable-diffusion/ldm/models/diffusion/ddpm.py", line 2148, in forward out = self.diffusion_model(x, t, context=cc) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/Users/tmm1/code/stable-diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 811, in forward h = module(h, emb, context) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/Users/tmm1/code/stable-diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 88, in forward x = layer(x, context) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/Users/tmm1/code/stable-diffusion/ldm/modules/attention.py", line 346, in forward x = block(x, context=context) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/Users/tmm1/code/stable-diffusion/ldm/modules/attention.py", line 296, in forward return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint) File "/Users/tmm1/code/stable-diffusion/ldm/modules/diffusionmodules/util.py", line 157, in checkpoint return func(*inputs) File "/Users/tmm1/code/stable-diffusion/ldm/modules/attention.py", line 301, in _forward x = self.attn2(self.norm2(x), context=context) + x File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/Users/tmm1/code/stable-diffusion/ldm/modules/attention.py", line 274, in forward r1 = self.einsum_op(q, k, v, r1) File "/Users/tmm1/code/stable-diffusion/ldm/modules/attention.py", line 189, in einsum_op_v1 r1 = einsum('b i j, b j d -> b i d', s2, v) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/functional.py", line 360, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1659484612588/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass Summoning checkpoint. Traceback (most recent call last): File "/Users/tmm1/code/stable-diffusion/./main.py", line 947, in trainer.fit(model, data) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train self.fit_loop.run() File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance batch_output = self.batch_loop.run(kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step self.trainer._call_lightning_module_hook( File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1672, in optimizer_step optimizer.step(closure=optimizer_closure) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step return optimizer.step(closure=closure, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 113, in wrapper return func(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/optim/adamw.py", line 119, in step loss = closure() File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure closure_result = closure() File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in __call__ self._result = self.closure(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 141, in closure self._backward_fn(step_output.closure_loss) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 304, in backward_fn self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook output = fn(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 191, in backward self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, optimizer_idx, *args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward model.backward(closure_loss, optimizer, optimizer_idx, *args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1417, in backward loss.backward(*args, **kwargs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Function 'PermuteBackward0' returned nan values in its 0th output. ```

tmm1 commented 1 year ago

This is all very new to me, but if I'm interpreting the output correctly it seems to suggest the gradients from einsum are nan.

Important bits:

UserWarning: Error detected in PermuteBackward0. Traceback of forward call that caused the error:
  File "/Users/tmm1/code/stable-diffusion/ldm/modules/attention.py", line 301, in _forward
    x = self.attn2(self.norm2(x), context=context) + x
  File "/Users/tmm1/code/stable-diffusion/ldm/modules/attention.py", line 274, in forward
    r1 = self.einsum_op(q, k, v, r1)
  File "/Users/tmm1/code/stable-diffusion/ldm/modules/attention.py", line 189, in einsum_op_v1
    r1 = einsum('b i j, b j d -> b i d', s2, v)
...
  File "/opt/homebrew/anaconda3/envs/ldm/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1417, in backward
    loss.backward(*args, **kwargs)
RuntimeError: Function 'PermuteBackward0' returned nan values in its 0th output.

This is including the changes merged into development this morning.

tmm1 commented 1 year ago

I started working with the training functionality last night as well and ran into problems on CUDA. The textual inversion modifications to ddpm.py seem to have adversely affected vanilla training and we'll have to do a careful comparison with the original CompViz implementation in order to isolate the conflicts.

@lstein What sort of problems did you run into on CUDA? I wonder if you can try this and see if any anomalies are detected?

diff --git a/main.py b/main.py
index c45194d..57c8832 100644
--- a/main.py
+++ b/main.py
@@ -864,6 +864,7 @@ if __name__ == '__main__':
         ]
         trainer_kwargs['max_steps'] = trainer_opt.max_steps

+        trainer_opt.detect_anomaly = True
         trainer = Trainer.from_argparse_args(trainer_opt, **trainer_kwargs)
         trainer.logdir = logdir  ###

tmm1 commented 1 year ago

I switched CrossAttention#forward to original implementation, and the same anomaly is detected. So atleast it is not seemingly related to the performance tweaks there.

I don't have a CUDA setup to test with, so maybe nan at this step is expected. This happens right away for me, whereas loss was fine for a few hundred steps before so there must be a different anomaly later on.

lstein commented 1 year ago

I'm actually getting a core dump at a step that says "validating". I'm trying to get IT to install gdb on the cluster so that I can do a stack trace, not that it will be very helpful.

Lincoln

On Tue, Sep 13, 2022 at 12:43 PM Aman Gupta Karmani < @.***> wrote:

I started working with the training functionality last night as well and ran into problems on CUDA. The textual inversion modifications to ddpm.py seem to have adversely affected vanilla training and we'll have to do a careful comparison with the original CompViz implementation in order to isolate the conflicts.

@lstein https://github.com/lstein What sort of problems did you run into on CUDA? I wonder if you can try this and see if any anomalies are detected?

diff --git a/main.py b/main.py index c45194d..57c8832 100644--- a/main.py+++ b/main.py@@ -864,6 +864,7 @@ if name == 'main': ] trainer_kwargs['max_steps'] = trainer_opt.max_steps

trainer_opt.detect_anomaly = True trainer = Trainer.from_argparse_args(trainer_opt, **trainer_kwargs) trainer.logdir = logdir ###

— Reply to this email directly, view it on GitHub https://github.com/lstein/stable-diffusion/issues/517#issuecomment-1245673318, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3EVIKE3BZ5PDHJQI5LE3V6CVKNANCNFSM6AAAAAAQKODTKA . You are receiving this because you were mentioned.Message ID: @.***>

Birch-san commented 1 year ago

Maybe this is another opportunity to try replacing einsum with matmul?

https://github.com/Birch-san/stable-diffusion/commit/d2d533dbc3fe2e430a3ab9feaf23aa28a8b8178f

It's like 30% slower, but it might do something different regarding NaN?

Context:. https://github.com/huggingface/diffusers/issues/452#issuecomment-1243044775

tmm1 commented 1 year ago

Good idea!

But it just failed in the same way, so either nan is normal or something else bigger is the problem.

  File "/Users/tmm1/code/stable-diffusion/ldm/modules/attention.py", line 255, in forward
    sim = torch.matmul(q, k.transpose(1, 2)) * self.scale
RuntimeError: Function 'TransposeBackward0' returned nan values in its 0th output.

Any-Winter-4079 commented 1 year ago

If it's about replacing einsum, the other day I tried https://github.com/dgasmith/opt_einsum (just to see if it was faster). Not sure if this can be another alternative. This goes way over my head, but I'll try to mess around to see if I can get it to work, even if by trial and error

Any-Winter-4079 commented 1 year ago

@tmm1 Have you encountered this error? RuntimeError: Placeholder storage has not been allocated on MPS device!

tmm1 commented 1 year ago

Hm no I didn't see that one.

Any-Winter-4079 commented 1 year ago

@tmm1 Have you encountered this error? RuntimeError: Placeholder storage has not been allocated on MPS device!

Well, for anyone that encounters the issue, it's fixed with pip install pytorch-lightning==1.7.5 (which you mentioned in the first comment, but I naively tried to get by without updating my environment, but nope. It's needed.

Birch-san commented 1 year ago

If it's about replacing einsum, the other day I tried https://github.com/dgasmith/opt_einsum (just to see if it was faster).

wow, that's cool. yeah, it's just a drop-in replacement:
https://github.com/CompVis/stable-diffusion/commit/b7357a75a8f83ed3543b70e725be168720c8cf39
I just did pip install opt_einsum and made that source change.

unfortunately I'm finding opt_einsum to be about 30x slower on MPS.

8 steps inference:
opt_einsum took 158.9 secs at 40~20s/it,
regular einsum took 10.4 secs at ~1.25s/it.

Any-Winter-4079 commented 1 year ago

Yes, it's very slow. There must be some underlying problem because it's not normal (especially given they report to be very fast)

Any-Winter-4079 commented 1 year ago

Messing around, I see v seems to have some nan values. I tried converting to number: https://pytorch.org/docs/stable/generated/torch.nan_to_num.html#torch.nan_to_num However, I can't seem to always correct it. Using torch.nan_to_num before the r1 einsum (r1 = einsum('b i j, b j d -> b i d', s2, v))

print(torch.any(v.isnan()))
v = torch.nan_to_num(v)
print(torch.any(v.isnan()))
print('---')

Sometimes it seems to change nan to num

But other times it seems to stay at nan (?)

Update: tried this other way, which seems to work better to remove nan (although converting to 0 always):

v = torch.where(torch.isnan(v), torch.zeros_like(v), v)
s2 = torch.where(torch.isnan(s2), torch.zeros_like(s2), s2)

but still get RuntimeError: Function 'PermuteBackward0' returned nan values in its 0th output

Note: not that changing nan to num is going to help much. There is probably an underlying issue that makes the values grow too much.

Any-Winter-4079 commented 1 year ago

Okay, so the einsum comes from python3.9/site-packages/torch/functional.py I tried adding print('C', torch.any(operands[0].isnan()), torch.any(operands[1].isnan())) right before return _VF.einsum(equation, operands) # type: ignore[attr-defined] and the last print before it crashes is C tensor(False, device='mps:0') tensor(False, device='mps:0') So I understand the operands are not nan and they may turn into nan after the operation (?)

Birch-san commented 1 year ago

hmm I thought the problem occurs during the backward pass, but the einsum is part of the forward pass? am I misunderstanding, or is there work we need to do to get eyes on the operations that are run in the backward pass?

also, NaN during training is a common problem in F16 precision, but I thought F32 was pretty safe from it. are we definitely using F32? I guess we probably are, since I think that's the only float type MPS supports?

is it possible that NaN is caused by large gradients, so could occur when your learning rate is too high?

Birch-san commented 1 year ago

sometimes different optimizers and learning rates are recommended for fine-tuning..

Any-Winter-4079 commented 1 year ago

Well, the error message is very long, but at some point is says:

  File "/Users/eduardoarinopelegrin/Downloads/stable-diffusion/ldm/modules/attention.py", line 308, in _forward
    x = self.attn2(self.norm2(x), context=context) + x
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/eduardoarinopelegrin/Downloads/stable-diffusion/ldm/modules/attention.py", line 281, in forward
    r1 = self.einsum_op(q, k, v, r1)
  File "/Users/eduardoarinopelegrin/Downloads/stable-diffusion/ldm/modules/attention.py", line 210, in einsum_op_mps_v1
    r1 = self.einsum_op_compvis(q, k, v, r1)
  File "/Users/eduardoarinopelegrin/Downloads/stable-diffusion/ldm/modules/attention.py", line 202, in einsum_op_compvis
    r1 = einsum('b i j, b j d -> b i d', s2, v)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/torch/functional.py", line 368, in einsum
    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
 (Triggered internally at  /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1659484611838/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

That is the end of the first part of the error output. The console then stops printing errors for about 30 seconds, and then prints another long output, which this time ends with

File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 304, in backward_fn
    self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 191, in backward
    self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, optimizer_idx, *args, **kwargs)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward
    model.backward(closure_loss, optimizer, optimizer_idx, *args, **kwargs)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1417, in backward
    loss.backward(*args, **kwargs)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'PermuteBackward0' returned nan values in its 0th output.

In the first part of the output, einsum is referenced and in the second part of the output, nan is referenced. I think that is why @tmm1 thought it might have to do with that, so I was trying to explore if nan were passing through einsum.

What I've done is go to .../lib/python3.9/site-packages/torch/functional.py and in particular the referenced line: return _VF.einsum(equation, operands) and added above

print(torch.any(operands[0].isnan()), torch.any(operands[1].isnan()))
print(torch.any(_VF.einsum(equation, operands)).isnan())
print('___')

to see if any of the operands contained nan as well as to see if the operation was returning some nan The second operand sometimes contains nan, but the result always seems to not have nan

tensor(False, device='mps:0') tensor(True, device='mps:0')
tensor(False, device='mps:0')
___

Before, I also tried adding a line to remove the nan from the operands and even from the result of the operation (although I've since seen that the output always seems to not contain nan.

In any case, the error persists, so your suggestion to look into the backward pass seems a very good idea. However, I don't know much about what goes on in that backward pass.

Birch-san commented 1 year ago

maybe worth putting some NaN checks on the tensors in the calls involved in the backwards pass? like around model.backward and around loss.backward.

Any-Winter-4079 commented 1 year ago

maybe worth putting some NaN checks on the tensors in the calls involved in the backwards pass? like around model.backward and around loss.backward.

model.backward seems to be in .../lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py and loss.backward in .../lib/python3.9/site-packages/pytorch_lightning/core/module.py so I'll have a look at that.

Any-Winter-4079 commented 1 year ago

Okay, so model.backward is failing. if I add a try except there

# do backward pass
        if model is not None and isinstance(model, pl.LightningModule):
            print('closure_loss', closure_loss)
            print('optimizer', optimizer)
            print('optimizer_idx', optimizer_idx)
            try:
                model.backward(closure_loss, optimizer, optimizer_idx, *args, **kwargs)
            except:
                print('model.backward error')
        else:
            self._run_backward(closure_loss, *args, **kwargs)

It seems to continue in the epoch without the nan error.

Epoch 0:   1%|▌                                         | 8/606 [00:30<38:09,  3.83s/it, loss=0.127, v_num=0, train/loss_simple_step=0.0284, train/loss_vlb_step=0.000119, train/loss_step=0.0284, global_step=7.000]

Epoch 0:   1%|▋                                           | 9/606 [00:32<36:06,  3.63s/it, loss=0.132, v_num=0, train/loss_simple_step=0.173, train/loss_vlb_step=0.000674, train/loss_step=0.173, global_step=8.000]

And it prints

closure_loss tensor(0.0284, device='mps:0', grad_fn=<DivBackward0>)
optimizer AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.005
    maximize: False
    weight_decay: 0.01
)
optimizer_idx 0
model.backward error

so it definitely is going through the except.

The question is how to debug model.backward (e.g. how do I look for nan, for example).

Update: the model seems to come from model: "pl.LightningModule",, which comes from `import pytorch_lightning as pl`` So I guess that may be a point to start https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html

Birch-san commented 1 year ago

I don't think LightningModule will be relevant. it's just a generic base class on which to build a neural network module.

I think backward() will be implemented by the lightning framework, but its success will entirely depend on what parameters it receives from us. I think it uses the loss that we return from forward(), to backpropagate gradients.

I think you need to look at LatentDiffusion#forward, and/or the base class from which it inherits, DDPM#forward(), and/or the wrapper DiffusionWrapper#forward.

each forward() function delegates the heavy lifting to a p_losses(), which will invoke self.model() or self.apply_model(). that's the forward bit, which I suspect is fine (because inference is fine).

after that, p_losses invokes self.get_loss(). I suspect that somewhere inside get_loss(), we will get our first NaN.

there are two loss_types which self.get_loss() can use: l1 and l2.

I suspect our NaN will come from inside there. and if that's what happens, then I think the ways to fix it are:

change the loss function (e.g. by changing loss_type to try the other one, or by setting mean to the opposite of what it currently is, or by writing an entirely new loss function)
reduce the learning rate
use bigger datatypes (e.g. double-precision float) — don't think we can do this on MPS though
use smaller batch sizes to attain more stable convergence

Any-Winter-4079 commented 1 year ago

By default, that function seems to use l2 && mean=False. Adding print(torch.any(loss.isnan())) right before return loss shows tensor(False, device='mps:0') up to the moment it crashes, so the nan doesn't seem to come from there.

I've also added a try-except, but the except never seems to get called.

Edit: these results were with trainer_opt.detect_anomaly = True. After setting it to False, it doesn't immediately crash, and after running for a while, there are indeed some nan in get_loss

Any-Winter-4079 commented 1 year ago

It's training (setting trainer_opt.detect_anomaly = False) but I guess there'll be a problem. Here's to hoping the anomaly was the nan's in v (in einsum) that were removed after _VF.einsum (but ha, fat chance!) I'll update if I make some progress

Any-Winter-4079 commented 1 year ago

Okay, so here's some new discoveries. After some time, loss turns to nan, together with train/loss_simple_step, train/loss_vlb_step and train/loss_step

This is the moment where it turns to nan, captured with

print('min', torch.min(loss))
print('max', torch.max(loss))
print('nan in loss', torch.any(loss.isnan()))

in def get_loss(self, pred, target, mean=True) in ddpm.py

tmm1 commented 1 year ago

@Any-Winter-4079 Nice. I only added detect_anomaly to try to find the issue, but I think it is pointing at some red-herring.

So your idea of trying to catch the nan loss is much better!

I was seeing the same problem, everything changes to nan and then by the time a model is created it will only output black images. Same can be seen in the training logs all the test images are black.

Any-Winter-4079 commented 1 year ago

The problem seems to come from pred, which goes to -inf/inf and then introduces nan in the loss via def get_loss(self, pred, target, mean=True)

Also, a bit of a weird behaviour, reporting that min is inf and max is -inf :)

And get_loss gets called from def p_losses(self, x_start, cond, t, noise=None) so pred must comes from model_output in that function.

Any-Winter-4079 commented 1 year ago

model_output comes itself from model_output = self.apply_model(x_noisy, t, cond) and looking at the params (x_noisy, t, cond):

cond seems to be what goes to inf/-inf, and that is propagated. cond -> model_output -> loss

And where does cond come from? cond is passed to def p_losses(self, x_start, cond, t, noise=None) which is called here:

EliasOenal commented 1 year ago

I was seeing the same problem, everything changes to nan and then by the time a model is created it will only output black images. Same can be seen in the training logs all the test images are black.

This could be completely unrelated, but I noticed that on my M1 Max Mac any regular inference step seems to have a chance to sporadically result in a black image. I haven't found the time to debug this, but I've seen outputs generate fine for the first ten steps and then suddenly turn black in the succeeding steps. If it is similar non-deterministic behavior in this case as well, then maybe just checking the result and calling the function again could be viable as a temporary workaround until the underlying issue is identified.

Any-Winter-4079 commented 1 year ago

I've experienced the same too. Not sure about the cause but @Birch-san I think had debugged a problem with black images. I just haven't had time to look at it yet.

For all I know (which is not much), it could be related. I guess the non-determinism comes from randomly chosen values and some bug in the code. If we don't know how to fix it, it may be a good idea to try.

Birch-san commented 1 year ago

the black images problem that I solved was in k-diffusion, where there was a 100% chance to return a black image (±inf):
https://github.com/lstein/stable-diffusion/issues/517#issuecomment-1245074288

I am aware that the DDIM and PLMS samplers seem to have a random chance to go black, and this chance is worsened by running it for more steps. I never investigated or fixed that, because the k-diffusion samplers are better anyway.
I didn't check whether k-diffusion samplers experience the same problem at higher step counts (because they perform well at low step counts anyway).

if it only happens sporadically… you could detect it and somehow throw away that training step / refrain from computing the loss or backpropagating the gradients into the model weights.

but it sounds like it could be beneficial to approach this by trying to catch the sporadic problem during inference instead of training.

EliasOenal commented 1 year ago

PyTorch has several tickets on similar issues, apparently this is known and WIP: #84138 #81185

Any-Winter-4079 commented 1 year ago

Well, I've been able to go further than ever. Eventually started running DDIM sampling.

I've also been monitoring the c variable at entrance and exit in def forward(self, x, c, *args, **kwargs): At least in this run, c value at the exit has been the same I would say for the 499 iterations it's done. Which surprised me, since c at the entrance changes every time: ['a rendering of a '] ['a good photo of the '] ['a bright photo of the *'] etc. But I don't know how this part works, so it might be the correct/expected behavior.

Anyway, I've got an error (new one)

Traceback (most recent call last):
  File "/Users/eduardoarinopelegrin/Downloads/stable-diffusion/./main.py", line 949, in <module>
    trainer.fit(model, data)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 219, in advance
    self.trainer._call_callback_hooks("on_train_batch_end", batch_end_outputs, batch, batch_idx)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1597, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/Users/eduardoarinopelegrin/Downloads/stable-diffusion/main.py", line 535, in on_train_batch_end
    self.log_img(pl_module, batch, batch_idx, split='train')
  File "/Users/eduardoarinopelegrin/Downloads/stable-diffusion/main.py", line 512, in log_img
    logger_log_images(pl_module, images, pl_module.global_step, split)
  File "/Users/eduardoarinopelegrin/opt/anaconda3/envs/do_not_touch-osx-arm64-stable-diffusion/lib/python3.9/site-packages/pytorch_lightning/utilities/rank_zero.py", line 32, in wrapped_fn
    return fn(*args, **kwargs)
  File "/Users/eduardoarinopelegrin/Downloads/stable-diffusion/main.py", line 446, in _testtube
    pl_module.logger.experiment.add_image(
AttributeError: 'ExperimentWriter' object has no attribute 'add_image'

EliasOenal commented 1 year ago

If you have a public repository for your WIP version, I'd be happy to give it a try when I find the time. But this just looks like an issue with the pytorch lightning library. Maybe an incompatible version.

Any-Winter-4079 commented 1 year ago

@EliasOenal My version is not published yet (lots of changes/prints from other tests :), but I'm pretty sure it's the Dev branch from this repo with the same updates as https://github.com/lstein/stable-diffusion/issues/517#issue-1369862339 but with these changes: trainer_config['strategy'] = 'ddp' and trainer_opt.detect_anomaly = False

What I have added new is

def forward(self, x, c, *args, **kwargs):
        # print('c in min', torch.min(c))
        # print('c in max', torch.max(c))
        inf = True
        it = 0
        c_orig = c
        while inf:
            print(c)
            it += 1
            print('it', it)
            t = torch.randint(
                0, self.num_timesteps, (x.shape[0],), device=self.device
            ).long()
            if self.model.conditioning_key is not None:
                assert c_orig is not None
                if self.cond_stage_trainable:
                    c = self.get_learned_conditioning(c_orig)
                if self.shorten_cond_schedule:  # TODO: drop this option
                    tc = self.cond_ids[t].to(self.device)
                    c = self.q_sample(
                        x_start=c, t=tc, noise=torch.randn_like(c.float())
                    )
            inf = torch.isinf(c).any().item()
            print('c out min', torch.min(c))
            print('c out max', torch.max(c))
        print('$$$$$')
        return self.p_losses(x, c, t, *args, **kwargs)

in ddpm.py

I'm not sure if I'll be able to catch the inf/-inf with torch.isnan(c).any().item() but if not, the idea is the same (just changing that line). Update: changed to inf = torch.isinf(c).any().item() Anyway I will do a PR if I get everything to work, but with the changes above you should be set to be in the same state as me!

Edit: Upon seeing this error

pl_module.logger.experiment.add_image(
AttributeError: 'ExperimentWriter' object has no attribute 'add_image'

I also commented that out from main.py (not sure if it's Important -will have to look into what it does)

    @rank_zero_only
    def _testtube(self, pl_module, images, batch_idx, split):
        pass
        # for k in images:
        #     grid = torchvision.utils.make_grid(images[k])
        #     grid = (grid + 1.0) / 2.0  # -1,1 -> 0,1; c,h,w

        #     tag = f'{split}/{k}'
        #     pl_module.logger.experiment.add_image(
        #         tag, grid, global_step=pl_module.global_step
        #     )

And now I just got to 100% and found TypeError: on_train_epoch_end() missing 1 required positional argument: 'outputs' for which I've changed def on_train_epoch_end(self, trainer, pl_module, outputs): to def on_train_epoch_end(self, trainer, pl_module, outputs=None):

invoke-ai / InvokeAI

Textual Inversion Training on M1 (works!) #517