Loss flattens out after 50k training steps

I'm currently training Imagen on two similar medical video datasets (one with over 10,000 videos and the other with around 500) to generate 32x32 videos. I've noticed that both models start out with a very high average loss of 1000 or more, which then begins to flatten out around 50.

Below are two of my ongoing experiments: https://wandb.ai/alif-munim/imagen-echonet https://wandb.ai/alif-munim/imagen-uhn

I previously ran similar experiments for over 100k steps but noticed very similar results. I would love to hear from anyone who's successfully trained text-to-video. Is this the expected behavior? How many steps does it typically take until Imagen can generate reasonable videos?

lucidrains / imagen-pytorch

Loss flattens out after 50k training steps #324