ruotianluo / self-critical.pytorch

Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.
MIT License
995 stars 278 forks source link

Version compitability of pytorch-lightning #282

Open kaelsunkiller opened 1 year ago

kaelsunkiller commented 1 year ago

May I ask which version of pl did you use for developing this codebase?

I tried the newest 2.0 but got lots of bugs, params and functions deprecated, etc. So I degrade it to 1.5 now, with the compatible torch 1.8.0 and torchmetrics, but still find it stuck at step 1770/1850 epoch 0, very confusing.

I thought it might have gone through the validation step, because of a warning by pl as below:

/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:56: UserWarning: Trying to infer the 'batch_size' from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use 'self.log(..., batch_size=batch_size)'.

The batch size changed to 1, and also this warning is new in pl 1.5. I don't know if it causes any error in computation.

Back to the stuck issue, I waited for more than 30 mins which is much longer than the eta of training one epoch. Still stuck, no errors or warnings, desperate...

Too many uncertain issues with pl training. So I have to ask the version that can work with this codebase. Thanks a lot!

ruotianluo commented 1 year ago

I should have something working for 2.0. Let me push.

kaelsunkiller commented 1 year ago

I should have something working for 2.0. Let me push.

That would be great!

BTW, I found the problem. It's probably caused by the size-changing of the last batch when using mulit-gpus in pl. My batch size is set to 64, with 8 gpus, so the last batch fed to gpus will be 7 (which is incompatible with 8 gpus). Then I added drop_last=True in the train dataloader but still got stuck at the last step of validation ( number of validation images is 5000, the last batch size should be 8 which is compatible with 8 gpus, each gpu has batch size 1). So I think my environment may have some issues with batch size 1, or just the in-epoch size changing.