Is this correct behavior?

giuvecchio / matfuse-sd

MatFuse: Controllable Material Generation with Diffusion Models (CVPR2024)

https://gvecchio.com/matfuse/

MIT License

31 stars 2 forks source link

Is this correct behavior? #9

Closed Night1099 closed 4 months ago

Night1099 commented 4 months ago

Im on Epoch 11 now of a 130000 dataset, all validation hasnt changed from first epochs, but loss is still going down

reconstruction reconstruction_gs-068749_e-000010_b-000006

samples quantized

samples_x0_quantized_gs-068749_e-000010_b-000007

i am of course fine to wait longer for training but is this correct?

giuvecchio commented 4 months ago

Hi, it doesn't look like the correct behavior. Which part of the model are you training? Is it the autoencoder or the U-Net?

Night1099 commented 4 months ago

The LDM step

The autoencoder came out quite well

Night1099 commented 4 months ago

i also found this odd from step 1 of LDM training the loss was about .01 and now is down to .0009 ish but with above results

giuvecchio commented 4 months ago

Looking at the reconstruction, it seems that the autoencoder weights are not loaded correctly. Do the console logs look correct when starting or resuming the training?

Night1099 commented 4 months ago

here is full train start log

https://pastecode.io/s/jd5h5pw1

Validation sanity does come up and says 0/2 even tho i know when training autoencoder it said 0/63

Would that be the problem?

and you can see that it sees 63

#### Data #####
train, MatFuseDataset, 173319
validation, MatFuseDataset, 63

Night1099 commented 4 months ago

ah dang it says missing keys for all those i thought that was found keys

i assume thats the problem haha

its just the raw ckpt file the autoencoder train step gave out

giuvecchio commented 4 months ago

Yes, that's the issue! Looks like the checkpoint loading is not correctly looking for the weights under "state_dict". Fixing it right away! In the meanwhile you can look at the init_form_checkpoint method in the base VQModel

giuvecchio commented 4 months ago

Ok, should be fixed in f0e773c!

Night1099 commented 4 months ago

ok it started without any key readout, ill let this run and get back to you

Thank you!

Night1099 commented 4 months ago

ok training looks correct now thanks again