Closed Night1099 closed 4 months ago
Hi, it doesn't look like the correct behavior. Which part of the model are you training? Is it the autoencoder or the U-Net?
The LDM step
The autoencoder came out quite well
i also found this odd from step 1 of LDM training the loss was about .01 and now is down to .0009 ish but with above results
Looking at the reconstruction, it seems that the autoencoder weights are not loaded correctly. Do the console logs look correct when starting or resuming the training?
here is full train start log
https://pastecode.io/s/jd5h5pw1
Validation sanity does come up and says 0/2 even tho i know when training autoencoder it said 0/63
Would that be the problem?
and you can see that it sees 63
#### Data #####
train, MatFuseDataset, 173319
validation, MatFuseDataset, 63
ah dang it says missing keys for all those i thought that was found keys
i assume thats the problem haha
its just the raw ckpt file the autoencoder train step gave out
Yes, that's the issue! Looks like the checkpoint loading is not correctly looking for the weights under "state_dict". Fixing it right away! In the meanwhile you can look at the init_form_checkpoint
method in the base VQModel
Ok, should be fixed in f0e773c!
ok it started without any key readout, ill let this run and get back to you
Thank you!
ok training looks correct now thanks again
Im on Epoch 11 now of a 130000 dataset, all validation hasnt changed from first epochs, but loss is still going down
reconstruction
samples quantized
i am of course fine to wait longer for training but is this correct?