Textual Inversion Training on M1 (works!)

tmm1 commented 2 years ago

WIP HERE: https://github.com/lstein/stable-diffusion/compare/development...tmm1:dev-train-m1

I started experimenting with running main.py on M1 and wanted to document some immediate issues.

Looks like we need a newer pytorch-lightning for MPS. Currently using 1.6.5 but latest is 1.7.5

However bumping it causes this error:

AttributeError: module 'pytorch_lightning.loggers' has no a ttribute 'TestTubeLogger'. Did you mean: 'NeptuneLogger'?

which is because TestTubeLogger was deprecated: https://github.com/Lightning-AI/lightning/issues/13958#issuecomment-1200780456

Birch-san commented 2 years ago

prefer torch.isinf(c).any().item() if you're looking for ±Inf.

Any-Winter-4079 commented 2 years ago

@Birch-san Thanks! I've updated my changes above too with the latest error (at 100% of Epoch 0). By the way if anyone is going to try this out, keep only the prints you really need, as they slow each iteration down

Update: After the DDIM sampling, I get this warning

UserWarning: `ModelCheckpoint(monitor='val/loss_simple_ema')` could not find the monitored key in the returned metrics: ['train/loss_simple', 'train/loss_simple_step', 'train/loss_vlb', 'train/loss_vlb_step', 'train/loss', 'train/loss_step', 'global_step', 'epoch', 'step']. HINT: Did you call `log('val/loss_simple_ema', value)` in the `LightningModule`?
  warning_cache.warn(m)
Epoch 0, global step 500: 'val/loss_simple_ema' was not in top 1

But at least it's finished Epoch 0

Average Epoch time: 1434.72 seconds
Average Peak memory 0.00MiB

As a curiosity / something strange, since I've added the prints, I'm not getting nan nearly as often (yesterday I was getting them pretty soon in Epoch 0). Already in Epoch 1 now. In case it does happen, I added the while inf loop.

Update 2: Finally encountered inf/-inf in Epoch 1. Neither torch.isnan(c).any().item() nor torch.isinf(c).any().item() seem to catch inf/-inf

Will try with

inf =   torch.isnan(c).any().item() or torch.isinf(c).any().item() or \
           torch.isinf(torch.max(c)).item() or torch.isinf(torch.min(c)).item() or \
           torch.isnan(torch.max(c)).item() or torch.isnan(torch.min(c)).item()

Success! inf/-inf were caught.

Birch-san commented 2 years ago

I'm surprised the min/max checks are necessary.

oh.

torch.isinf(torch.tensor([1])/0).any().item()
# True

torch.isinf(torch.tensor([1], device='mps')/0).any().item()
# False

…what.

Issue created: https://github.com/pytorch/pytorch/issues/85106

Any-Winter-4079 commented 2 years ago

Okay, so back to this. In def forward(self, x, c, *args, **kwargs): c enters as ['a cropped photo of a *'] as an example, and exits as a tensor. On some rare occasions, values in c go to nan

c cond_stage_trainable tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='mps:0',
       grad_fn=<NativeLayerNormBackward0>)

That print comes from

if self.cond_stage_trainable:
                    c = self.get_learned_conditioning(c_orig)
                    print('c cond_stage_trainable', c)

I've tried adding a while inf and simply repeating the call, but seems like every time, nan values are returned.

Here you can see it 8824 (I fell asleep for a while) and it still has nan.

So, next I'll try exploring this function

def get_learned_conditioning(self, c):
        if self.cond_stage_forward is None:
            if hasattr(self.cond_stage_model, 'encode') and callable(
                self.cond_stage_model.encode
            ):
                c = self.cond_stage_model.encode(
                    c, embedding_manager=self.embedding_manager
                )
                if isinstance(c, DiagonalGaussianDistribution):
                    c = c.mode()
            else:
                c = self.cond_stage_model(c)
        else:
            assert hasattr(self.cond_stage_model, self.cond_stage_forward)
            c = getattr(self.cond_stage_model, self.cond_stage_forward)(c)

        return c

to see why it always returns nan in c even though its entry param c seems to be the same value, e.g. ['a cropped photo of the *'], that in previous iterations has worked. There may be other variables inside the function that have changed I guess.

Birch-san commented 2 years ago

Do we reckon that an important part of the repro is to run it for many steps (about an epoch.. and epochs are 606 images?)

like, if you saved a checkpoint near 1 epoch: would you expect to encounter the problem shortly after loading?

I'm trying to work out "is this a 1% chance and we just need to roll the dice enough times" or is it a "running for longer is an important precondition".

if it is a "running for longer" problem, then I wonder whether it's some kind of accumulated floating-point inaccuracy, such as is described here:
https://github.com/pytorch/pytorch/issues/84936

Any-Winter-4079 commented 2 years ago

I had 6 images (they suggest 3-5 -didn't realise there was an extra), so it was 606 steps. Now I've cut it to 3 images, so I'm completing an epoch in 303 steps.

I'm encountering the error either way, so Im not sure at this point if it's indeterministic behavior (e.g. small chance of nan per it, and it's just a matter of probability) or nan comes for other reasons (e.g. error in inputs, gradients, losses, etc.).

like, if you saved a checkpoint near 1 epoch: would you expect to encounter the problem shortly after loading?

That might be something to try!

Any-Winter-4079 commented 2 years ago

Here for example, I'm encountering nan at 34% of Epoch 0 Epoch 0: 34%|██████████████████████▉ | 104/303 [03:08<06:00, 1.81s/it, loss=nan, v_num=0, train/loss_simple_step=nan.0, train/loss_vlb_step=nan.0, train/loss_step=nan.0, global_step=103.0]

this time removing the prints. I have the feeling (but it might be just pure bias) that when I add prints, it lasts longer (completes the first epoch more likely than not vs. viceversa). Could it be that a print causes a value to go to cpu or some other thing that makes things 'better'? Or slows down the gpu and makes it less error prone? Idk.

Birch-san commented 2 years ago

Guessing here, but sounds completely plausible to me that printing (i.e. transferring a copy of the tensor to CPU) could have side-effects, yes.

Although.. all you're transferring to CPU is a single Boolean. But maybe computing isinf() over the tensor has some kind of effect. I think every operation adds a node to the computational graph, which is utilised by the backward pass for gradient backpropagation. But I don't know enough about how it works to know whether this would be consequential.

As for whether it has a beneficial slowing effect.. I think I saw MPS issues about concurrency, or how a computation can be influenced by what's cached beforehand. so maybe that matters.

Any-Winter-4079 commented 2 years ago

To report my last attempt before I go to bed (3:15 here), it seems to be learning something at least?

I'm not very familiar with how it works internally, but I would call this progress. In logs/burger2022-09-16T02-31-52_my_burger/images/train I see samples_gs-000500_e-000001_b-000199.png

samples_scaled_gs-000500_e-000001_b-000199.png

None of those are in my training set, which is the following:

I also have the checkpoint but I haven't tried @Birch-san 's suggestion of loading from there to continue training. metrics.csv is:

All of this was up to step 752 (which now with 3 images is Epoch 2 50%, so it completed Epochs 0 and 1 and half of the second. I did have a bunch of print which could or could not have made a difference).

If someone is feeling adventurous I'd encourage to try it. The more people that tests this probably the more bugs/weird behavior we can find (like the inf/-inf and torch.isinf)!

Any-Winter-4079 commented 2 years ago

Currently trying to load from .ckpt to resume training as suggested, but the dict seems to be empty {} I am getting this warning at the end of Epoch 0

UserWarning: `ModelCheckpoint(monitor='val/loss_simple_ema')` could not find the monitored key in the returned metrics: ['train/loss_simple', 'train/loss_simple_step', 'train/loss_vlb', 'train/loss_vlb_step', 'train/loss', 'train/loss_step', 'global_step', 'epoch', 'step']. HINT: Did you call `log('val/loss_simple_ema', value)` in the `LightningModule`?
  warning_cache.warn(m)
Epoch 0, global step 500: 'val/loss_simple_ema' was not in top 1

so I asked here in case someone with CUDA (outside M1) experiences the same

Any-Winter-4079 commented 2 years ago

Also, there was a previous embeddings.pt (not the latest one, which outputs random images in black and white), which upon a photo of * in high quality, detailed picture, 8k, artstation, vibrant colors -s 20 outputs burgers 001003 3356948889 But something like a photo of * in the hands of a Tom Cruise, high detail, 8k -s 20 fails and outputs 000999 948063444 It was only like 1-2 epochs it trained for, so training for longer may produce more coherence. In any case, I update because we are not far, and even though it's a bit buggy, seeing at least some results is encouraging.

EliasOenal commented 2 years ago

The version of pytorch used seems to make a big difference. On the nightlies I get NaN on step two reliably. @Any-Winter-4079 further I've noticed the speeds I'm seeing are much lower than what your screenshots indicate. I am getting ~12s/it on an M1 Max, have you tweaked anything in order to achieve ~2s/it? (I see high GPU load, so mps is active.)

Any-Winter-4079 commented 2 years ago

@EliasOenal I'm using 1.12.1 as nightly is slower, as you say. Also, there is the latest update for speed, not yet merged (https://github.com/lstein/stable-diffusion/pull/582, which by the way @Birch-san you may want to add to your repo).

I also have num_workers=10 and

self.num_workers = (
            10
        )

(my CPU cores). But not sure if that makes a difference.

Any-Winter-4079 commented 2 years ago

@EliasOenal You mean on the second epoch or on the second step of the first epoch? If you are getting NaN on the second epoch (Epoch 1), that's where I'm seeing it the most too (sometimes at Epoch2, rarely at Epoch 0).

Any-Winter-4079 commented 2 years ago

By the way, it seems pred is what goes to NaN, which by the way, is not normalised? At least not -1 to 1

tensor([[[[-0.9070,  1.1422,  0.2337,  ...,  0.6926,  0.5076,  0.2312],
          [ 1.2821, -0.7480, -1.6530,  ...,  0.3883,  0.5181, -1.8114],
          [ 0.9122,  0.4990, -0.2845,  ...,  1.0343, -0.1174, -0.0139],
          ...,
          [-2.7908,  0.1058, -0.1103,  ...,  0.3809,  1.9895,  0.5667],
          [ 0.4922, -0.2873,  1.9048,  ..., -0.9527, -0.0040,  1.4782],
          [ 0.0279, -1.5981,  1.5414,  ...,  0.9938,  0.3461,  0.2506]],

         [[ 0.7946, -0.4964, -1.5475,  ...,  1.1144,  0.4206,  1.5213],
          [-1.2461, -0.5346, -0.6677,  ..., -0.4309,  0.8820, -0.7152],
          [-0.4116,  0.5359,  0.7523,  ..., -0.9546,  0.0519, -0.9330],
          ...,
          [-0.6324,  0.2504, -0.1679,  ...,  0.9425, -1.3993, -0.9232],
          [ 1.9189,  0.0851, -0.1664,  ...,  1.2863,  0.7146,  0.5905],
          [-1.0338,  0.8190,  1.4619,  ...,  0.0362, -0.0131, -1.1003]],

         [[-0.1265, -1.0799,  0.3885,  ...,  0.6771, -1.6883, -0.7425],
          [-1.0996,  0.4505, -0.3360,  ..., -0.8754, -0.3665,  0.9793],
          [-0.0369, -0.4248,  0.6339,  ..., -1.1220, -0.0533,  0.1543],
          ...,
          [-0.4716,  0.2988,  0.8327,  ...,  0.0877, -0.2676, -1.5864],
          [-0.9548,  0.2204, -2.1214,  ..., -0.8743, -1.5195, -0.8521],
          [-0.7534,  0.6483,  0.2687,  ..., -0.5459,  0.1746, -1.0746]],

         [[-0.2135,  0.8470, -1.5916,  ..., -1.4197, -1.7272,  0.4620],
          [-1.3449, -0.4242, -0.2954,  ..., -0.1218,  0.7973, -0.1709],
          [ 1.9218,  0.6341, -0.3088,  ...,  0.0626, -0.0719,  2.3299],
          ...,
          [-0.1524,  0.2463, -0.4012,  ..., -0.0048, -0.3533, -0.5027],
          [-1.2391,  0.3282, -0.9266,  ...,  1.2407, -0.5316, -0.7290],
          [ 1.3435,  0.4594,  0.5614,  ...,  0.5130,  0.4320, -1.5459]]]],
       device='mps:0', grad_fn=<ConvolutionBackward0>)

If that is the case, then pred can make loss NaN via loss = (target - pred).abs() and then it extends to other variables such as c (which enters def get_learned_conditioning(self, c): as ['a photo of a dirty *'] and exits as a tensor. The operation where it must change to NaN is

c = self.cond_stage_model.encode(
                    c, embedding_manager=self.embedding_manager
                )

which I'm not sure how it works, but it must access loss or pred (for example there is a line self.embedding_manager.embedding_to_coarse_loss().mean() which I haven't tested if it gets called but may relate the embedding_manager to the loss.

Anyway, I'll be looking to correct NaN in pred

Birch-san commented 2 years ago

btw, if you're running on a nightly build: beware that there's a bug with einsum() which will make cross-attention return the wrong result the first time it's invoked.
https://github.com/pytorch/pytorch/issues/85224

Birch-san commented 2 years ago

Also, there is the latest update for speed, not yet merged (#582, which by the way @Birch-san you may want to add to your repo).

thanks very much for this tip! sounds like it's faster (uses cache better), so I'll definitely take a look.
but I'm currently keeping my branch close to original CompVis implementation, because it makes it easier to investigate problems like https://github.com/pytorch/pytorch/issues/85224.

Any-Winter-4079 commented 2 years ago

It looks like model_out contains infinity. And then it is fed to get_loss as the prediction, to be compared against the target. Hence, loss in get_loss also becomes infinity and so on. It's a ripple effect.

In case it was a noise problem (who knows), I tried setting up a while loop. While the prediction contains infinity, repeat it, yet the noise seems to always be the same (hence, same prediction).

The noise is set here noise = default(noise, lambda: torch.randn_like(x_start)) and the default function comes fromldm/util

Now, since val is None (I checked) I would expect the function to be called and return random values. But I get == tensor(True, device='mps:0') so there may be a problem there.

Very preliminary results. Still, managed to correct model_output having inf re-generating noise!

Code (with no comments)

while inf:
            if noise is None:
                noise2 = torch.randn_like(x_start)
            else:
                noise2 = noise
            x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise2)
            model_output = self.apply_model(x_noisy, t, cond)
            inf = torch.isinf(torch.min(model_output)).item() or torch.isinf(torch.max(model_output)).item()

We can clean the code up, but I'm more keen on whether the model is going to learn and not get loss nan

Birch-san commented 2 years ago

good sleuthing.

we know randn is quirky on MPS. randn_like will generate the random numbers using the same device as the input tensor, so yeah we're exposed to that quirkiness here.

I wonder what would happen if — instead of that while loop — you replaced torch.randn_like(x_start) with torch.randn_like(x_start, device='cpu')…

could use torch.randn_like(x_start, device='cpu' if x_start.device.type == 'mps' else x_start.device) to be considerate to CUDA users.

Birch-san commented 2 years ago

nevermind, sounds like you're saying the inf lies within the model_output, not the random noise?

I tried some while loops to generate random numbers on MPS, and didn't get inf out of the random function. so probably no need to try my random-on-CPU idea.

tmm1 commented 2 years ago

Can we reuse fix_func from generate.py?

https://github.com/lstein/stable-diffusion/blob/development/ldm/generate.py#L45

Any-Winter-4079 commented 2 years ago

Yes, model_output goes to -inf/inf, but affected by the noise. We'd need new noise to just try again (if this fix works).

Update: the issue with the random generation was it was being imported from ldm/util.py, and for some reason, if you invoked the function several times, it gave the same result. So:

from ldm.util import (
    default,
)
noise = default(noise, lambda: torch.randn_like(x_start))
noise2 = default(noise, lambda: torch.randn_like(x_start))
print(noise == noise2) # True

but

noise = torch.randn_like(x_start)
noise2 = torch.randn_like(x_start)
print(noise == noise2) # False

Oh, I know why it gave the same result. Because I'm overwriting noise, and it's no longer None. Duh. So for the second noise2 = default(noise, lambda: torch.randn_like(x_start)), noise is no longer None, and the function never gets called. Well, both options should work then, within the loop.

Any-Winter-4079 commented 2 years ago

Can we reuse fix_func from generate.py?

https://github.com/lstein/stable-diffusion/blob/development/ldm/generate.py#L45

fix_func was introduced because -S (seed) was not working for k_euler_a and another sampler, if I remember correctly. I guess you can re-use it, yes. PS: What do you plan to use it for?

Any-Winter-4079 commented 2 years ago

I stopped training (Control + C) at 5120 steps [Epoch 17 7%] Best result: 'val/loss_simple_ema' reached 0.01681 (best 0.01681) -> Epoch 6 As suggested, I let it run for +5000 steps.

tmm1 commented 2 years ago

I tried some while loops to generate random numbers on MPS, and didn't get inf out of the random function. so probably no need to try my random-on-CPU idea.

Ah okay then fix_func also makes no sense.

I'm more keen on whether the model is going to learn and not get loss nan

Very interested to see the results!

tmm1 commented 2 years ago

I copy-pasted the entire fix_func block (which includes randn_like) to the top of main.py, and so far I am at Epoch 3 with no nan-bomb.

This is the furthest I have gotten, so it seems the issue is indeed with the mps rand?

(I also merged development earlier and rebuilt my conda env. I had to update tensorboard to allow pytorch-lightning 1.7.5. see https://github.com/lstein/stable-diffusion/compare/development...tmm1:dev-train-m1)

Epoch 3: 23%|███████████████████████████████▍ | 92/404 [01:56<06:34, 1.26s/it, loss=0.127, v_num=0, train/loss_simple_step=0.184, train/loss_vlb_step=0.000784, train/loss_step=0.184, global_step=1291.0, train/loss_simple_epoch=0.107, train/loss_vlb_epoch=0.00197, train/loss_epoch=0.107]

cc https://github.com/lstein/stable-diffusion/issues/397#issuecomment-1240679294

EDIT: Up to Epoch 5. The images in the training logs are not also not black anymore!

Any-Winter-4079 commented 2 years ago

Here are my results.

I trained with the following 3 images for 17 Epochs (0 through 16).

Every Epoch consists of 303 steps (101 number of images in training dataset), plus some (varying per Epoch) DDIM Sampling runs (each 200 iterations). The last 3 steps of each Epoch (for 3 images) might be dropped, because I stopped at step 5120, which is 300 17 + 20 = 5120 steps. Those 20 steps, are from Epoch 17, where I stopped.

Epochs

Epoch	Global step	MM:SS (min:s)	s/it (s/step)	DDIM
0	300	08:38	1.71
1	600	18:54	3.74	Yes
2	900	09:33	1.89
3	1200	12:52	2.55	Yes
4	1500	23:28	4.65	Yes
5	1800	09:56	1.97
6	2100	13:28	2.67	Yes
7	2400	10:06	2.00
8	2700	13:51	2.75	Yes
9	3000	27:18	5.41	Yes
10	3300	26:32	5.26
11	3600	16:32	3.28	Yes
12	3900	10:15	2.03
13	4200	14:10	2.81	Yes
14	4500	23:44	4.72	Yes
15	4800	11:22	2.25
16	5100	16:53	3.35

Time per epoch mostly depends on the number of DDIM samplings on that epoch. Overheating and running other applications (especially if streaming) also play a factor. From what I observed, peak RAM usage may have been about 50GB. It didn't look very problematic, though.

Results

The embeddings files seem to be saved every 600, 600, 300 steps. These are all of the created files.

embeddings_gs-600.pt
embeddings_gs-1200.pt
embeddings_gs-1800.pt
embeddings_gs-2100.pt
embeddings_gs-2700.pt
embeddings_gs-3300.pt
embeddings_gs-3600.pt
embeddings_gs-4200.pt
embeddings_gs-4800.pt
embeddings_gs-5100.pt
embeddings.pt

Running the following command a photo of * -m k_euler -s 10 -n3 and loading each of these files, we get:

embeddings_gs-600.pt

embeddings_gs-1200.pt

embeddings_gs-1800.pt

embeddings_gs-2100.pt

embeddings_gs-2700.pt

embeddings_gs-3300.pt

embeddings_gs-3600.pt

embeddings_gs-4200.pt

embeddings_gs-4800.pt

embeddings_gs-5100.pt

embeddings.pt

Yes, it did run (no nan) but it did not seem to learn. Which is a bit surprising because in images/train and images/val there are a bunch of burger images (others are black)

I will have to investigate tomorrow. @tmm1 let us know how your training goes

PS: I have to add my best epoch is the 7th (of 17), which is not a great sign. Anyway, best val/loss_simple_ema was 0.016806211322546005 around step 1800.

Birch-san commented 2 years ago

any idea whether the newfound success is because of CPU rand, or because you updated dependencies?

tmm1 commented 2 years ago

I got to 4400 embeddings and stopped. No nan, and seeing no black images in train/ or val/

I am re-running now without the fix_func to make it still fails.

Any-Winter-4079 commented 2 years ago

Epoch 3: 23%|███████████████████████████████▍ | 92/404 [01:56<06:34, 1.26s/it, loss=0.127, v_num=0, train/loss_simple_step=0.184, train/loss_vlb_step=0.000784, train/loss_step=0.184, global_step=1291.0, train/loss_simple_epoch=0.107, train/loss_vlb_epoch=0.00197, train/loss_epoch=0.107]

1.26s/it is fast! Do you have the 128GB RAM M1?

tmm1 commented 2 years ago

64GB M1 Ultra

I hit nan at Epoch 1: 99/404 without the fix_funcs

Any-Winter-4079 commented 2 years ago

I have 64GB M1 Max. That may explain the speed difference. I'll have to try tomorrow with fix_func Do you get good results using your embeddings.pts?

Edit: Okay, about my results above, I was using a photo of * but it was trained on a photo of a * At least it DOES seem to be producing burgers. 001122 2616763588

I'll update my results on the comment above tomorrow then.

What I wonder is if we can remove a lot of these phrases. Would it work the same (and be faster) if we only trained with 'a photo of *'? Also, I wonder if it can be used to learn your face or that is too specific and would create a random person, since these burgers are NOT the same I trained it on.

Any-Winter-4079 commented 2 years ago

a close-up photo of a * in the style of Van Gogh -s 15 001125 1380320324

tmm1 commented 2 years ago

I'm getting strange results with embeddings.pt also.

It almost seems like the * does nothing. When I was using huggingface embeddings I used the other phrase like <ugly-sonic> which worked better.

As an example I tried the same seed for one prompt and removed '*', and got the exact same image back.

tmm1 commented 2 years ago

These are almost identical:

[25] outputs/img-samples/000369.1.png: "a photo of a *" -s 50 -W 512 -H 512 -C 7.5 -A k_lms -S 1

[26] outputs/img-samples/000370.1.png: "a photo of a" -s 50 -W 512 -H 512 -C 7.5 -A k_lms -S 1

tmm1 commented 2 years ago

Same prompt and seed is different when --embedding_path is omitted, so maybe '*' maps to some type of base that is already included in the prompt?

Any-Winter-4079 commented 2 years ago

Yes, you are right "a close-up photo of a * in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324 and "a close-up photo of a in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324

Also without 'a' "close-up photo of a in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324

And if we remove other parts, like 'close-up' or 'photo' "a photo of * in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324

"close-up of a in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324

Even the extreme case, where we remove everything ('a close-up photo of a *') it has some resemblance. "in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324

I guess it's mostly about the seed (to get similar results) even if parts of the prompt are missing. And yes, there seems to be a starting point already e.g. "A beautiful waterfall" -s50 -W512 -H512 -C7.5 -Ak_lms -S4100999182 gives me

so it seems everything revolves around your training (in my case, burgers/food)? That is when it behaves correctly. Other times, it seems to be just a random result.

Hmm, I'll try again tomorrow. So far it seems it loses context (e.g. a photo of a in a swimming pool, in New York, etc. None of that works. No swimming pool, no New York. Even with people, like Emma Watson, outputs nothing similar. Just either burgers -not exactly the one it was trained on- or random images). I guess the training is the problem?

Otherwise we could just use the current model and say a burger in New York, which gives

which is a burger, presumably in New York.

The only point I see of training is if it learns a new thing (e.g. your face) and can at least merge it with some context (e.g. in New York). If it can only output something similar (e.g. another face) and loses context, I don't see the usefulness.

But, I must say the ugly-sonic worked much, much better, so it must be training I'm hoping.

Any-Winter-4079 commented 2 years ago

@i3oc9i @EliasOenal @Vargol @heurihermilab @krummrey Just a heads up that Textual Inversion (kinda) works on M1. In case you want to train and share your results, to help improve it. We finally have moved past a nan loss problem.

Now it's all about how to train properly (number of images, learning rate, number of epochs, sampler, etc.)

i3oc9i commented 2 years ago

@Any-Winter-4079

@i3oc9i @EliasOenal @Vargol @heurihermilab @krummrey Just a heads up that Textual Inversion (kinda) works on M1. In case you want to train and share your results, to help improve it. We finally have moved past a nan loss problem.

Thank you a lot for this information, I will give a try durring this week

tmm1 commented 2 years ago

FYI, I trained to 18k steps overnight without any nan issues.

Epoch 46: 46%|███▏ | 187/404 [12:55<14:59, 4.15s/it, loss=0.083, v_num=0, train/loss_simple_step=0.102, train/loss_vlb_step=0.000348, train/loss_step=0.102, global_step=18586.0, train/loss_simple_epoch=0.111, train/loss_vlb_epoch=0.00233, train/loss_epoch=0.111]

tmm1 commented 2 years ago

Also I found an implementation of a different paper which offers much better textual inversion: https://github.com/lstein/stable-diffusion/issues/107#issuecomment-1250545275

Any-Winter-4079 commented 2 years ago

@tmm1 I tried your solution of fix_func to solve the rand issue (vs. the while loop solution, re-generating the noise), and preliminary results, but I tend to prefer yours.

I tried training for 4 epochs while in class (until my battery almost died), and I got better times per epoch (~2s/it) than yesterday on average. My best val/loss_simple_ema was better than yesterday (0.00699... vs. 0.01680...), although that is probably pure chance/luck. But most importantly, no black images.

So it seemed faster + no black images.

About results, a bit mixed again. Sometimes I find a good seed, like -S3320183151 and almost no matter the prompt, it produces burgers. "in the style of Van Gogh" -s10 -W512 -H512 -C7.5 -Ak_lms -S3320183151

"in the style of Van Gogh a * painting" -s10 -W512 -H512 -C7.5 -Ak_lms -S3320183151

And then other seeds seem to produce unrelated content, no matter the prompt.

What val/loss_simple_ema did you obtain after 18k steps? And the results... are they good/better?

About

Also I found an implementation of a different paper which offers much better textual inversion: https://github.com/lstein/stable-diffusion/issues/107#issuecomment-1250545275

I will try to test this because it's literally what I've been trying to get.

heurihermilab commented 2 years ago

@Any-Winter-4079 Training on my work is definitely a goal of mine, and textual inversion is the closest I've seen, so definitely interested in testing. Everything upthread is a lot to grep quickly, though, and to me the inner workings are a bunch of black boxes.

So please tell me, can testing be done with the current development branch, or is there another commit (or patch etc) I should test off of instead? As long as I know I'm starting correctly I can puzzle through the command line and see what comes out.

Also, should I limit training input to photographic imagery? I've got a lot of abstract mathematical visual work that is relatively unique and may offer an easily-detectable signal.

Any-Winter-4079 commented 2 years ago

https://github.com/lstein/stable-diffusion/compare/development...tmm1:dev-train-m1 That is the code. You can see there are 4 files changed, e.g. in ldm/data/personalized.py (red is removed, green is added)

You can add those changes in your local code. One thing is, make sure to have (or update to) pytorch-lightning==1.7.5.

About photographic imagery, I don't know. It's all very green and new and most things we are finding out by trial and error. I suggest you try and report your findings/discoveries!

Any-Winter-4079 commented 2 years ago

I get this error TypeError: __init__() got an unexpected keyword argument 'reg' trying to adapt https://github.com/XavierXiao/Dreambooth-Stable-Diffusion to this repo.

@tmm1 not sure if you get the same.

PS: They use pytorch-lightning==1.5.9 which is not good for us (pytorch-lightning==1.7.5 for mps). Hope we can adapt it just like we did for Textual Inversion.

Update: Okay, about the error, forgot to update ldm/dat/personalized.py

tmm1 commented 2 years ago

But most importantly, no black images.

Awesome, so it seems we need more than just randn_like fixed for proper operation, and copying all the fix_func is the right solution until pytorch figures out mps rand issues upstream.

Any-Winter-4079 commented 2 years ago

I got this repo https://github.com/XavierXiao/Dreambooth-Stable-Diffusion/blob/main/ldm/data/personalized.py to run for Epoch 0, even if with dummy data (3 training and 3 testing images). 60GB RAM at peak. 23.76s/it 13 minutes total, 29 steps completed. To get to 800 steps they recommend, it'd translate to 358 minutes, or 6 hours. Now, we may need a lot more images in the training and testing set.

I didn't clone the repo though. Simply brought some files to my local version, like main.py, personalized.py... I might've even missed something. I just didn't want to re-do all the MPS changes we have in this repo...

tmm1 commented 2 years ago

Nice!

Did you create some regularization images too? Seems like that is a big part of how it learns what is different in your training set compared to generic versions of that same thing.

Any-Winter-4079 commented 2 years ago

I created 6 burger images and split them, 3 in training_data and 3 in reg_data.

Birch-san commented 2 years ago

so you copied every fix_funcd function?

looks like it came from here:
https://github.com/lstein/stable-diffusion/pull/579

sounds like the original intention was to improve determinism. but seems it has the happy side-effect of preventing ±Inf? it's Inf and not NaN?

copying all the fix_func is the right solution until pytorch figures out mps rand issues upstream.

there's no issue or minimal repro currently; the pytorch team don't currently know that randomness sometimes returns ±Inf:
https://github.com/pytorch/pytorch/issues?q=is%3Aissue+MPS

can we wrangle a minimal repro for them? they're pretty responsive but I think the MPS specialists are a small team and really benefit from any investigation we can do.

invoke-ai / InvokeAI

Textual Inversion Training on M1 (works!) #517

Epochs

Results