Open DidiD1 opened 2 months ago
honestly none of the weighting tricks really seem relevant to finetuning SD3. not using the timestep weighting has better results.
honestly none of the weighting tricks really seem relevant to finetuning SD3. not using the timestep weighting has better results.
Could u give some more details, thanks a lot
yes, if you look at the timestep selection distribution using the SD3 style training, it effectively does not ever train the 900-1000 or 0-100 range of timesteps. they are just ignored:
ignoring the gaps in the chart here (wandb was having issues) the timestep selection at the end is where i switched to uniform sampling and the model started learning composition and details properly
This phenomenon was mentioned in the SD3 paper,maybe why they proposed 'mode sampling with heavy-tails' time-sampling method. However it's strange that in their experiment results 'log-norm' is much better the 'mode' and uniform sampling. So I guess that maybe the different sampling method has their special advantages which needs experiment to valid which one is suitable for own task.
it just needs an absolutely enormous batch size for these to make sense.
edit: also worth noting these parameters are likely dependent on model size, the same way LR scales with model size when not using microsoft/mup
Thanks a lot. And for my question3: "when we use logit_normal, it based on the RF-setting. So the weight of the loss should be t/(1-t), but the code doesn't compute the weight instead of torch.ones_like(sigmas)?" Do I need to modify the loss weight?
yes, if you look at the timestep selection distribution using the SD3 style training, it effectively does not ever train the 900-1000 or 0-100 range of timesteps. they are just ignored:
ignoring the gaps in the chart here (wandb was having issues) the timestep selection at the end is where i switched to uniform sampling and the model started learning composition and details properly
Thanks a lot
yes, if you look at the timestep selection distribution using the SD3 style training, it effectively does not ever train the 900-1000 or 0-100 range of timesteps. they are just ignored:
ignoring the gaps in the chart here (wandb was having issues) the timestep selection at the end is where i switched to uniform sampling and the model started learning composition and details properly
@bghira Hi bghira~ I'd like to know when you try the "SD3 style training (lognorm sampling)" or "uniform sampling", what is the difference between the training loss? When you switched to uniform sampling, did it help to lower the loss curve? In my uniform training, these is still some artifacts in the generated image, so I wonder which part in the noise sampling is important to improve this problem? Want to hear your insights, Thanks~
currently we're using sigmoid sampling for timesteps which seems fine but no one has really ablated whether it leaves fine details out
currently we're using sigmoid sampling for timesteps which seems fine but no one has really ablated whether it leaves fine details out
Actually, sigmoid and lognorm are mathematically equivalent. But I'm curious why existing open source training implementations don't use timeshift during training, but SD3 paper does.
currently we're using sigmoid sampling for timesteps which seems fine but no one has really ablated whether it leaves fine details out
Actually, sigmoid and lognorm are mathematically equivalent. But I'm curious why existing open source training implementations don't use timeshift during training, but SD3 paper does.
In fact, the diffusers version for SD3 has used the timashifting, You can see it in the init of FlowMatchEulerDiscreteScheduler, { "_class_name": "FlowMatchEulerDiscreteScheduler", "_diffusers_version": "0.29.0.dev0", "num_train_timesteps": 1000, "shift": 3.0 }
sigmas = shift sigmas / (1 + (shift - 1) sigmas)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Thanks to Rafie Walker's code we can try to train SD3 models with flow-matching! But some places don't seem to match what's in the paper. Rafie Walker's code is below:
My question is below:
So I think there need some modify to correctly compute the loss of SD3! Thanks for discussion together!