Closed sachinnitw1317 closed 3 months ago
I saw your config, i normally use:
total_samples_per_epoch=256 total_batch_size= 128
I think you are using much lower numbers for these, can you try with setting the above numbers?
If you have made any other changes in the config then let me know
other configs are the same. I reduced this to run on a T4 machine
Let me try with total_samples_per_epoch=256 total_batch_size= 128
Probably reducing batch size might work, but i think you should also try reducing the learning rate with it.
I think this issue might be happening due to high lr
I have started another run with a batch size of 128 as you suggested, all the other settings are same except capacity per GPU
Will know the result in a couple of hours
Hi,
I did some experiments to reproduce your results, but the model seems to lose all context after a certain number of epochs.
I am attaching the report here https://wandb.ai/sachin931350/align-prop/runs/ngkluhfs/overview
Please let me know what am i doing wrong