Closed tmm1 closed 2 years ago
prefer torch.isinf(c).any().item()
if you're looking for ±Inf.
@Birch-san Thanks! I've updated my changes above too with the latest error (at 100% of Epoch 0). By the way if anyone is going to try this out, keep only the prints you really need, as they slow each iteration down
Update: After the DDIM sampling, I get this warning
UserWarning: `ModelCheckpoint(monitor='val/loss_simple_ema')` could not find the monitored key in the returned metrics: ['train/loss_simple', 'train/loss_simple_step', 'train/loss_vlb', 'train/loss_vlb_step', 'train/loss', 'train/loss_step', 'global_step', 'epoch', 'step']. HINT: Did you call `log('val/loss_simple_ema', value)` in the `LightningModule`?
warning_cache.warn(m)
Epoch 0, global step 500: 'val/loss_simple_ema' was not in top 1
But at least it's finished Epoch 0
Average Epoch time: 1434.72 seconds
Average Peak memory 0.00MiB
As a curiosity / something strange, since I've added the prints, I'm not getting nan
nearly as often (yesterday I was getting them pretty soon in Epoch 0). Already in Epoch 1 now. In case it does happen, I added the while inf
loop.
Update 2:
Finally encountered inf/-inf in Epoch 1.
Neither torch.isnan(c).any().item()
nor torch.isinf(c).any().item()
seem to catch inf/-inf
Will try with
inf = torch.isnan(c).any().item() or torch.isinf(c).any().item() or \
torch.isinf(torch.max(c)).item() or torch.isinf(torch.min(c)).item() or \
torch.isnan(torch.max(c)).item() or torch.isnan(torch.min(c)).item()
Success! inf/-inf were caught.
I'm surprised the min/max checks are necessary.
oh.
torch.isinf(torch.tensor([1])/0).any().item()
# True
torch.isinf(torch.tensor([1], device='mps')/0).any().item()
# False
…what.
Issue created: https://github.com/pytorch/pytorch/issues/85106
Okay, so back to this. In def forward(self, x, c, *args, **kwargs):
c enters as ['a cropped photo of a *']
as an example, and exits as a tensor. On some rare occasions, values in c go to nan
c cond_stage_trainable tensor([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]], device='mps:0',
grad_fn=<NativeLayerNormBackward0>)
That print comes from
if self.cond_stage_trainable:
c = self.get_learned_conditioning(c_orig)
print('c cond_stage_trainable', c)
I've tried adding a while inf
and simply repeating the call, but seems like every time, nan values are returned.
Here you can see it 8824
(I fell asleep for a while) and it still has nan.
So, next I'll try exploring this function
def get_learned_conditioning(self, c):
if self.cond_stage_forward is None:
if hasattr(self.cond_stage_model, 'encode') and callable(
self.cond_stage_model.encode
):
c = self.cond_stage_model.encode(
c, embedding_manager=self.embedding_manager
)
if isinstance(c, DiagonalGaussianDistribution):
c = c.mode()
else:
c = self.cond_stage_model(c)
else:
assert hasattr(self.cond_stage_model, self.cond_stage_forward)
c = getattr(self.cond_stage_model, self.cond_stage_forward)(c)
return c
to see why it always returns nan
in c
even though its entry param c
seems to be the same value, e.g. ['a cropped photo of the *']
, that in previous iterations has worked. There may be other variables inside the function that have changed I guess.
Do we reckon that an important part of the repro is to run it for many steps (about an epoch.. and epochs are 606 images?)
like, if you saved a checkpoint near 1 epoch: would you expect to encounter the problem shortly after loading?
I'm trying to work out "is this a 1% chance and we just need to roll the dice enough times" or is it a "running for longer is an important precondition".
if it is a "running for longer" problem, then I wonder whether it's some kind of accumulated floating-point inaccuracy, such as is described here:
https://github.com/pytorch/pytorch/issues/84936
I had 6 images (they suggest 3-5 -didn't realise there was an extra), so it was 606 steps. Now I've cut it to 3 images, so I'm completing an epoch in 303 steps.
I'm encountering the error either way, so Im not sure at this point if it's indeterministic behavior (e.g. small chance of nan per it, and it's just a matter of probability) or nan comes for other reasons (e.g. error in inputs, gradients, losses, etc.).
like, if you saved a checkpoint near 1 epoch: would you expect to encounter the problem shortly after loading?
That might be something to try!
Here for example, I'm encountering nan at 34% of Epoch 0 Epoch 0: 34%|██████████████████████▉ | 104/303 [03:08<06:00, 1.81s/it, loss=nan, v_num=0, train/loss_simple_step=nan.0, train/loss_vlb_step=nan.0, train/loss_step=nan.0, global_step=103.0]
this time removing the prints. I have the feeling (but it might be just pure bias) that when I add prints, it lasts longer (completes the first epoch more likely than not vs. viceversa). Could it be that a print causes a value to go to cpu or some other thing that makes things 'better'? Or slows down the gpu and makes it less error prone? Idk.
Guessing here, but sounds completely plausible to me that printing (i.e. transferring a copy of the tensor to CPU) could have side-effects, yes.
Although.. all you're transferring to CPU is a single Boolean. But maybe computing isinf() over the tensor has some kind of effect. I think every operation adds a node to the computational graph, which is utilised by the backward pass for gradient backpropagation. But I don't know enough about how it works to know whether this would be consequential.
As for whether it has a beneficial slowing effect.. I think I saw MPS issues about concurrency, or how a computation can be influenced by what's cached beforehand. so maybe that matters.
To report my last attempt before I go to bed (3:15 here), it seems to be learning something at least?
I'm not very familiar with how it works internally, but I would call this progress.
In logs/burger2022-09-16T02-31-52_my_burger/images/train
I see
samples_gs-000500_e-000001_b-000199.png
samples_scaled_gs-000500_e-000001_b-000199.png
None of those are in my training set, which is the following:
I also have the checkpoint but I haven't tried @Birch-san 's suggestion of loading from there to continue training.
metrics.csv
is:
All of this was up to step 752 (which now with 3 images is Epoch 2 50%, so it completed Epochs 0 and 1 and half of the second. I did have a bunch of print
which could or could not have made a difference).
If someone is feeling adventurous I'd encourage to try it. The more people that tests this probably the more bugs/weird behavior we can find (like the inf/-inf
and torch.isinf
)!
Currently trying to load from .ckpt to resume training as suggested, but the dict seems to be empty {} I am getting this warning at the end of Epoch 0
UserWarning: `ModelCheckpoint(monitor='val/loss_simple_ema')` could not find the monitored key in the returned metrics: ['train/loss_simple', 'train/loss_simple_step', 'train/loss_vlb', 'train/loss_vlb_step', 'train/loss', 'train/loss_step', 'global_step', 'epoch', 'step']. HINT: Did you call `log('val/loss_simple_ema', value)` in the `LightningModule`?
warning_cache.warn(m)
Epoch 0, global step 500: 'val/loss_simple_ema' was not in top 1
so I asked here in case someone with CUDA (outside M1) experiences the same
Also, there was a previous embeddings.pt (not the latest one, which outputs random images in black and white), which upon
a photo of * in high quality, detailed picture, 8k, artstation, vibrant colors -s 20
outputs burgers
But something like
a photo of * in the hands of a Tom Cruise, high detail, 8k -s 20
fails and outputs
It was only like 1-2 epochs it trained for, so training for longer may produce more coherence. In any case, I update because we are not far, and even though it's a bit buggy, seeing at least some results is encouraging.
The version of pytorch used seems to make a big difference. On the nightlies I get NaN on step two reliably. @Any-Winter-4079 further I've noticed the speeds I'm seeing are much lower than what your screenshots indicate. I am getting ~12s/it on an M1 Max, have you tweaked anything in order to achieve ~2s/it? (I see high GPU load, so mps is active.)
@EliasOenal I'm using 1.12.1 as nightly is slower, as you say. Also, there is the latest update for speed, not yet merged (https://github.com/lstein/stable-diffusion/pull/582, which by the way @Birch-san you may want to add to your repo).
I also have num_workers=10
and
self.num_workers = (
10
)
(my CPU cores). But not sure if that makes a difference.
@EliasOenal You mean on the second epoch or on the second step of the first epoch? If you are getting NaN on the second epoch (Epoch 1), that's where I'm seeing it the most too (sometimes at Epoch2, rarely at Epoch 0).
By the way, it seems pred
is what goes to NaN, which by the way, is not normalised? At least not -1 to 1
tensor([[[[-0.9070, 1.1422, 0.2337, ..., 0.6926, 0.5076, 0.2312],
[ 1.2821, -0.7480, -1.6530, ..., 0.3883, 0.5181, -1.8114],
[ 0.9122, 0.4990, -0.2845, ..., 1.0343, -0.1174, -0.0139],
...,
[-2.7908, 0.1058, -0.1103, ..., 0.3809, 1.9895, 0.5667],
[ 0.4922, -0.2873, 1.9048, ..., -0.9527, -0.0040, 1.4782],
[ 0.0279, -1.5981, 1.5414, ..., 0.9938, 0.3461, 0.2506]],
[[ 0.7946, -0.4964, -1.5475, ..., 1.1144, 0.4206, 1.5213],
[-1.2461, -0.5346, -0.6677, ..., -0.4309, 0.8820, -0.7152],
[-0.4116, 0.5359, 0.7523, ..., -0.9546, 0.0519, -0.9330],
...,
[-0.6324, 0.2504, -0.1679, ..., 0.9425, -1.3993, -0.9232],
[ 1.9189, 0.0851, -0.1664, ..., 1.2863, 0.7146, 0.5905],
[-1.0338, 0.8190, 1.4619, ..., 0.0362, -0.0131, -1.1003]],
[[-0.1265, -1.0799, 0.3885, ..., 0.6771, -1.6883, -0.7425],
[-1.0996, 0.4505, -0.3360, ..., -0.8754, -0.3665, 0.9793],
[-0.0369, -0.4248, 0.6339, ..., -1.1220, -0.0533, 0.1543],
...,
[-0.4716, 0.2988, 0.8327, ..., 0.0877, -0.2676, -1.5864],
[-0.9548, 0.2204, -2.1214, ..., -0.8743, -1.5195, -0.8521],
[-0.7534, 0.6483, 0.2687, ..., -0.5459, 0.1746, -1.0746]],
[[-0.2135, 0.8470, -1.5916, ..., -1.4197, -1.7272, 0.4620],
[-1.3449, -0.4242, -0.2954, ..., -0.1218, 0.7973, -0.1709],
[ 1.9218, 0.6341, -0.3088, ..., 0.0626, -0.0719, 2.3299],
...,
[-0.1524, 0.2463, -0.4012, ..., -0.0048, -0.3533, -0.5027],
[-1.2391, 0.3282, -0.9266, ..., 1.2407, -0.5316, -0.7290],
[ 1.3435, 0.4594, 0.5614, ..., 0.5130, 0.4320, -1.5459]]]],
device='mps:0', grad_fn=<ConvolutionBackward0>)
If that is the case, then pred
can make loss NaN via loss = (target - pred).abs()
and then it extends to other variables such as c (which enters def get_learned_conditioning(self, c):
as ['a photo of a dirty *'] and exits as a tensor. The operation where it must change to NaN is
c = self.cond_stage_model.encode(
c, embedding_manager=self.embedding_manager
)
which I'm not sure how it works, but it must access loss
or pred
(for example there is a line self.embedding_manager.embedding_to_coarse_loss().mean()
which I haven't tested if it gets called but may relate the embedding_manager to the loss.
Anyway, I'll be looking to correct NaN in pred
btw, if you're running on a nightly build: beware that there's a bug with einsum()
which will make cross-attention return the wrong result the first time it's invoked.
https://github.com/pytorch/pytorch/issues/85224
Also, there is the latest update for speed, not yet merged (#582, which by the way @Birch-san you may want to add to your repo).
thanks very much for this tip! sounds like it's faster (uses cache better), so I'll definitely take a look.
but I'm currently keeping my branch close to original CompVis implementation, because it makes it easier to investigate problems like https://github.com/pytorch/pytorch/issues/85224.
It looks like model_out
contains infinity. And then it is fed to get_loss
as the prediction, to be compared against the target. Hence, loss in get_loss also becomes infinity and so on. It's a ripple effect.
In case it was a noise problem (who knows), I tried setting up a while loop. While the prediction contains infinity, repeat it, yet the noise seems to always be the same (hence, same prediction).
The noise is set here
noise = default(noise, lambda: torch.randn_like(x_start))
and the default
function comes fromldm/util
Now, since val
is None (I checked) I would expect the function to be called and return random values. But I get == tensor(True, device='mps:0')
so there may be a problem there.
Very preliminary results. Still, managed to correct model_output having inf
re-generating noise!
Code (with no comments)
while inf:
if noise is None:
noise2 = torch.randn_like(x_start)
else:
noise2 = noise
x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise2)
model_output = self.apply_model(x_noisy, t, cond)
inf = torch.isinf(torch.min(model_output)).item() or torch.isinf(torch.max(model_output)).item()
We can clean the code up, but I'm more keen on whether the model is going to learn and not get loss nan
good sleuthing.
we know randn is quirky on MPS. randn_like
will generate the random numbers using the same device as the input tensor, so yeah we're exposed to that quirkiness here.
I wonder what would happen if — instead of that while loop — you replaced torch.randn_like(x_start)
with torch.randn_like(x_start, device='cpu')
…
could use torch.randn_like(x_start, device='cpu' if x_start.device.type == 'mps' else x_start.device)
to be considerate to CUDA users.
nevermind, sounds like you're saying the inf lies within the model_output, not the random noise?
I tried some while loops to generate random numbers on MPS, and didn't get inf out of the random function. so probably no need to try my random-on-CPU idea.
Can we reuse fix_func from generate.py?
https://github.com/lstein/stable-diffusion/blob/development/ldm/generate.py#L45
Yes, model_output goes to -inf/inf, but affected by the noise. We'd need new noise to just try again (if this fix works).
Update: the issue with the random generation was it was being imported from ldm/util.py
, and for some reason, if you invoked the function several times, it gave the same result. So:
from ldm.util import (
default,
)
noise = default(noise, lambda: torch.randn_like(x_start))
noise2 = default(noise, lambda: torch.randn_like(x_start))
print(noise == noise2) # True
but
noise = torch.randn_like(x_start)
noise2 = torch.randn_like(x_start)
print(noise == noise2) # False
Oh, I know why it gave the same result. Because I'm overwriting noise, and it's no longer None. Duh.
So for the second noise2 = default(noise, lambda: torch.randn_like(x_start))
, noise is no longer None, and the function never gets called.
Well, both options should work then, within the loop.
Can we reuse fix_func from generate.py?
https://github.com/lstein/stable-diffusion/blob/development/ldm/generate.py#L45
fix_func
was introduced because -S
(seed) was not working for k_euler_a
and another sampler, if I remember correctly.
I guess you can re-use it, yes.
PS: What do you plan to use it for?
I stopped training (Control + C) at 5120 steps [Epoch 17 7%]
Best result: 'val/loss_simple_ema' reached 0.01681 (best 0.01681)
-> Epoch 6
As suggested, I let it run for +5000 steps.
I tried some while loops to generate random numbers on MPS, and didn't get inf out of the random function. so probably no need to try my random-on-CPU idea.
Ah okay then fix_func
also makes no sense.
I'm more keen on whether the model is going to learn and not get loss
nan
Very interested to see the results!
I copy-pasted the entire fix_func block (which includes randn_like
) to the top of main.py, and so far I am at Epoch 3 with no nan-bomb.
This is the furthest I have gotten, so it seems the issue is indeed with the mps rand?
(I also merged development
earlier and rebuilt my conda env. I had to update tensorboard to allow pytorch-lightning 1.7.5. see https://github.com/lstein/stable-diffusion/compare/development...tmm1:dev-train-m1)
Epoch 3: 23%|███████████████████████████████▍ | 92/404 [01:56<06:34, 1.26s/it, loss=0.127, v_num=0, train/loss_simple_step=0.184, train/loss_vlb_step=0.000784, train/loss_step=0.184, global_step=1291.0, train/loss_simple_epoch=0.107, train/loss_vlb_epoch=0.00197, train/loss_epoch=0.107]
cc https://github.com/lstein/stable-diffusion/issues/397#issuecomment-1240679294
EDIT: Up to Epoch 5. The images in the training logs are not also not black anymore!
Here are my results.
I trained with the following 3 images for 17 Epochs (0 through 16).
Every Epoch consists of 303 steps (101 number of images in training dataset), plus some (varying per Epoch) DDIM Sampling runs (each 200 iterations). The last 3 steps of each Epoch (for 3 images) might be dropped, because I stopped at step 5120, which is 300 17 + 20 = 5120 steps. Those 20 steps, are from Epoch 17, where I stopped.
Epoch | Global step | MM:SS (min:s) | s/it (s/step) | DDIM |
---|---|---|---|---|
0 | 300 | 08:38 | 1.71 | |
1 | 600 | 18:54 | 3.74 | Yes |
2 | 900 | 09:33 | 1.89 | |
3 | 1200 | 12:52 | 2.55 | Yes |
4 | 1500 | 23:28 | 4.65 | Yes |
5 | 1800 | 09:56 | 1.97 | |
6 | 2100 | 13:28 | 2.67 | Yes |
7 | 2400 | 10:06 | 2.00 | |
8 | 2700 | 13:51 | 2.75 | Yes |
9 | 3000 | 27:18 | 5.41 | Yes |
10 | 3300 | 26:32 | 5.26 | |
11 | 3600 | 16:32 | 3.28 | Yes |
12 | 3900 | 10:15 | 2.03 | |
13 | 4200 | 14:10 | 2.81 | Yes |
14 | 4500 | 23:44 | 4.72 | Yes |
15 | 4800 | 11:22 | 2.25 | |
16 | 5100 | 16:53 | 3.35 |
Time per epoch mostly depends on the number of DDIM samplings on that epoch. Overheating and running other applications (especially if streaming) also play a factor. From what I observed, peak RAM usage may have been about 50GB. It didn't look very problematic, though.
The embeddings files seem to be saved every 600, 600, 300 steps. These are all of the created files.
embeddings_gs-600.pt
embeddings_gs-1200.pt
embeddings_gs-1800.pt
embeddings_gs-2100.pt
embeddings_gs-2700.pt
embeddings_gs-3300.pt
embeddings_gs-3600.pt
embeddings_gs-4200.pt
embeddings_gs-4800.pt
embeddings_gs-5100.pt
embeddings.pt
Running the following command a photo of * -m k_euler -s 10 -n3
and loading each of these files, we get:
embeddings_gs-600.pt
embeddings_gs-1200.pt
embeddings_gs-1800.pt
embeddings_gs-2100.pt
embeddings_gs-2700.pt
embeddings_gs-3300.pt
embeddings_gs-3600.pt
embeddings_gs-4200.pt
embeddings_gs-4800.pt
embeddings_gs-5100.pt
embeddings.pt
Yes, it did run (no nan
) but it did not seem to learn. Which is a bit surprising because in images/train
and images/val
there are a bunch of burger images (others are black)
I will have to investigate tomorrow. @tmm1 let us know how your training goes
PS: I have to add my best epoch is the 7th (of 17), which is not a great sign. Anyway, best val/loss_simple_ema
was 0.016806211322546005 around step 1800.
any idea whether the newfound success is because of CPU rand, or because you updated dependencies?
I got to 4400 embeddings and stopped. No nan, and seeing no black images in train/ or val/
I am re-running now without the fix_func to make it still fails.
Epoch 3: 23%|███████████████████████████████▍ | 92/404 [01:56<06:34, 1.26s/it, loss=0.127, v_num=0, train/loss_simple_step=0.184, train/loss_vlb_step=0.000784, train/loss_step=0.184, global_step=1291.0, train/loss_simple_epoch=0.107, train/loss_vlb_epoch=0.00197, train/loss_epoch=0.107]
1.26s/it is fast! Do you have the 128GB RAM M1?
64GB M1 Ultra
I hit nan at Epoch 1: 99/404 without the fix_func
s
I have 64GB M1 Max. That may explain the speed difference.
I'll have to try tomorrow with fix_func
Do you get good results using your embeddings.pt
s?
Edit: Okay, about my results above, I was using a photo of *
but it was trained on a photo of a *
At least it DOES seem to be producing burgers.
I'll update my results on the comment above tomorrow then.
What I wonder is if we can remove a lot of these phrases. Would it work the same (and be faster) if we only trained with 'a photo of *
'? Also, I wonder if it can be used to learn your face or that is too specific and would create a random person, since these burgers are NOT the same I trained it on.
a close-up photo of a * in the style of Van Gogh -s 15
I'm getting strange results with embeddings.pt also.
It almost seems like the *
does nothing. When I was using huggingface embeddings I used the other phrase like <ugly-sonic>
which worked better.
As an example I tried the same seed for one prompt and removed '*', and got the exact same image back.
These are almost identical:
[25] outputs/img-samples/000369.1.png: "a photo of a *" -s 50 -W 512 -H 512 -C 7.5 -A k_lms -S 1
[26] outputs/img-samples/000370.1.png: "a photo of a" -s 50 -W 512 -H 512 -C 7.5 -A k_lms -S 1
Same prompt and seed is different when --embedding_path
is omitted, so maybe '*' maps to some type of base that is already included in the prompt?
Yes, you are right
"a close-up photo of a * in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324
and
"a close-up photo of a in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324
Also without 'a'
"close-up photo of a in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324
And if we remove other parts, like 'close-up' or 'photo'
"a photo of * in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324
"close-up of a in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324
Even the extreme case, where we remove everything ('a close-up photo of a *') it has some resemblance.
"in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324
I guess it's mostly about the seed (to get similar results) even if parts of the prompt are missing.
And yes, there seems to be a starting point already
e.g.
"A beautiful waterfall" -s50 -W512 -H512 -C7.5 -Ak_lms -S4100999182
gives me
so it seems everything revolves around your training (in my case, burgers/food)? That is when it behaves correctly. Other times, it seems to be just a random result.
Hmm, I'll try again tomorrow. So far it seems it loses context (e.g. a photo of a in a swimming pool, in New York, etc. None of that works. No swimming pool, no New York. Even with people, like Emma Watson, outputs nothing similar. Just either burgers -not exactly the one it was trained on- or random images). I guess the training is the problem?
Otherwise we could just use the current model and say a burger in New York, which gives
which is a burger, presumably in New York.
The only point I see of training is if it learns a new thing (e.g. your face) and can at least merge it with some context (e.g. in New York). If it can only output something similar (e.g. another face) and loses context, I don't see the usefulness.
But, I must say the ugly-sonic worked much, much better, so it must be training I'm hoping.
@i3oc9i @EliasOenal @Vargol @heurihermilab @krummrey Just a heads up that Textual Inversion (kinda) works on M1. In case you want to train and share your results, to help improve it. We finally have moved past a nan
loss problem.
Now it's all about how to train properly (number of images, learning rate, number of epochs, sampler, etc.)
@Any-Winter-4079
@i3oc9i @EliasOenal @Vargol @heurihermilab @krummrey Just a heads up that Textual Inversion (kinda) works on M1. In case you want to train and share your results, to help improve it. We finally have moved past a
nan
loss problem.
Thank you a lot for this information, I will give a try durring this week
FYI, I trained to 18k steps overnight without any nan issues.
Epoch 46: 46%|███▏ | 187/404 [12:55<14:59, 4.15s/it, loss=0.083, v_num=0, train/loss_simple_step=0.102, train/loss_vlb_step=0.000348, train/loss_step=0.102, global_step=18586.0, train/loss_simple_epoch=0.111, train/loss_vlb_epoch=0.00233, train/loss_epoch=0.111]
Also I found an implementation of a different paper which offers much better textual inversion: https://github.com/lstein/stable-diffusion/issues/107#issuecomment-1250545275
@tmm1 I tried your solution of fix_func
to solve the rand
issue (vs. the while loop solution, re-generating the noise), and preliminary results, but I tend to prefer yours.
I tried training for 4 epochs while in class (until my battery almost died), and I got better times per epoch (~2s/it) than yesterday on average. My best val/loss_simple_ema
was better than yesterday (0.00699... vs. 0.01680...), although that is probably pure chance/luck. But most importantly, no black images.
So it seemed faster + no black images.
About results, a bit mixed again. Sometimes I find a good seed, like -S3320183151 and almost no matter the prompt, it produces burgers.
"in the style of Van Gogh" -s10 -W512 -H512 -C7.5 -Ak_lms -S3320183151
"in the style of Van Gogh a * painting" -s10 -W512 -H512 -C7.5 -Ak_lms -S3320183151
And then other seeds seem to produce unrelated content, no matter the prompt.
What val/loss_simple_ema
did you obtain after 18k steps? And the results... are they good/better?
About
Also I found an implementation of a different paper which offers much better textual inversion: https://github.com/lstein/stable-diffusion/issues/107#issuecomment-1250545275
I will try to test this because it's literally what I've been trying to get.
@Any-Winter-4079 Training on my work is definitely a goal of mine, and textual inversion is the closest I've seen, so definitely interested in testing. Everything upthread is a lot to grep quickly, though, and to me the inner workings are a bunch of black boxes.
So please tell me, can testing be done with the current development branch, or is there another commit (or patch etc) I should test off of instead? As long as I know I'm starting correctly I can puzzle through the command line and see what comes out.
Also, should I limit training input to photographic imagery? I've got a lot of abstract mathematical visual work that is relatively unique and may offer an easily-detectable signal.
https://github.com/lstein/stable-diffusion/compare/development...tmm1:dev-train-m1 That is the code.
You can see there are 4 files changed, e.g. in ldm/data/personalized.py
(red is removed, green is added)
You can add those changes in your local code. One thing is, make sure to have (or update to) pytorch-lightning==1.7.5
.
About photographic imagery, I don't know. It's all very green and new and most things we are finding out by trial and error. I suggest you try and report your findings/discoveries!
I get this error TypeError: __init__() got an unexpected keyword argument 'reg'
trying to adapt https://github.com/XavierXiao/Dreambooth-Stable-Diffusion to this repo.
@tmm1 not sure if you get the same.
PS: They use pytorch-lightning==1.5.9
which is not good for us (pytorch-lightning==1.7.5
for mps). Hope we can adapt it just like we did for Textual Inversion.
Update: Okay, about the error, forgot to update ldm/dat/personalized.py
But most importantly, no black images.
Awesome, so it seems we need more than just randn_like fixed for proper operation, and copying all the fix_func is the right solution until pytorch figures out mps rand issues upstream.
I got this repo https://github.com/XavierXiao/Dreambooth-Stable-Diffusion/blob/main/ldm/data/personalized.py to run for Epoch 0, even if with dummy data (3 training and 3 testing images). 60GB RAM at peak. 23.76s/it 13 minutes total, 29 steps completed. To get to 800 steps they recommend, it'd translate to 358 minutes, or 6 hours. Now, we may need a lot more images in the training and testing set.
I didn't clone the repo though. Simply brought some files to my local version, like main.py
, personalized.py
... I might've even missed something. I just didn't want to re-do all the MPS changes we have in this repo...
Nice!
Did you create some regularization images too? Seems like that is a big part of how it learns what is different in your training set compared to generic versions of that same thing.
I created 6 burger images and split them, 3 in training_data
and 3 in reg_data
.
so you copied every fix_func
d function?
looks like it came from here:
https://github.com/lstein/stable-diffusion/pull/579
sounds like the original intention was to improve determinism. but seems it has the happy side-effect of preventing ±Inf? it's Inf and not NaN?
copying all the fix_func is the right solution until pytorch figures out mps rand issues upstream.
there's no issue or minimal repro currently; the pytorch team don't currently know that randomness sometimes returns ±Inf:
https://github.com/pytorch/pytorch/issues?q=is%3Aissue+MPS
can we wrangle a minimal repro for them? they're pretty responsive but I think the MPS specialists are a small team and really benefit from any investigation we can do.
WIP HERE: https://github.com/lstein/stable-diffusion/compare/development...tmm1:dev-train-m1
I started experimenting with running main.py on M1 and wanted to document some immediate issues.
Looks like we need a newer pytorch-lightning for MPS. Currently using 1.6.5 but latest is 1.7.5
However bumping it causes this error:
which is because TestTubeLogger was deprecated: https://github.com/Lightning-AI/lightning/issues/13958#issuecomment-1200780456