lucidrains / magvit2-pytorch

Implementation of MagViT2 Tokenizer in Pytorch
MIT License
565 stars 34 forks source link

train on video dataset #7

Closed Y-ichen closed 12 months ago

Y-ichen commented 1 year ago

Thanks a lot for your implementation! Can this tokenizer be trained on video dataset in the current version? I found that its recon_loss is very large and cannot converge, and discr_loss cannot converge either.

Here shows the losses on the video dataset:

image
lucidrains commented 1 year ago

your x-axis, is that number of steps? you only did 600 training steps?

Y-ichen commented 1 year ago

Yes, these are the results after only 600 training steps. I trained magvit in an unconditional manner on UCF101 dataset.

During the training, I noticed the initial recon_loss value was very large (2e+4), so I checked the tensor value ranges when calculating recon_loss between the video and reconstructed video. I found the video values were between 0.0-255.0, while the reconstructed video values were around -1.0 to 1.0.

Therefore, I additionally normalized the data when loading videos to rescale the tensor range to -1.0 to 1.0. With this, the initial recon_loss is around 0.3, but the discr_loss is still around 2.0, much larger than recon_loss. I'm not sure if this will affect training, so I shrink discr_loss a bit by adding discr_weight of 0.1 to balance it with recon_loss. (Then the initial value of losses becomes: recon_loss=0.3, disrc_loss=0.2 around) Here is my new results of 3k steps training with these settings:

image

I'm retraining as above now - should I increase the training steps to at least 20k? And should I apply this normalization of the loaded video tensor range?

Y-ichen commented 12 months ago

Fixed

xesdiny commented 8 months ago

how did you do it?