hmorimitsu / ptlflow

PyTorch Lightning Optical Flow models, scripts, and pretrained weights.
Apache License 2.0
250 stars 33 forks source link

training is not working for craft, flowformer, gmflownet, gmflow #46

Closed nihalgupta84 closed 1 year ago

nihalgupta84 commented 1 year ago
self._result = self.closure(*args, **kwargs)

File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 134, in closure step_output = self._step_fn() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 427, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", step_kwargs.values()) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1766, in _call_strategy_hook output = fn(args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 333, in training_step return self.model.training_step(*args, *kwargs) File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/models/base_model/base_model.py", line 229, in training_step loss = self.loss_fn(preds, batch) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/models/gmflow/gmflow.py", line 40, in forward flow_loss += i_weight (valid[:, None] i_loss).mean() RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 2 Epoch 0: 0%| | 0/30712 [00:08<?, ?it/s]

can you please check the code?

hmorimitsu commented 1 year ago

Thank you for reporting, I'll check it later.

Just please notice that the training stage has not been tested, so there's no guarantee that the trained models will generate good results in the end.

Best,

nihalgupta84 commented 1 year ago

i'll try to check every stage and will push the result.

But you need to check the flow estimator part.

nihalgupta84 commented 1 year ago

it will be very appriciable, if you add some documentation about training.

nihalgupta84 commented 1 year ago

i have tried training with batch_size for craft

iw worked.

ptl trainer inbuilt auto_batch_size is also not working.

hmorimitsu commented 1 year ago

it will be very appriciable, if you add some documentation about training.

There is a documentation at https://ptlflow.readthedocs.io/en/latest/starting/training.html.

Is there anything else in specific that you think it is missing?

hmorimitsu commented 1 year ago

I have pushed a fix for the losses in those models you mentioned.

I hope it is working now, but if not, let me know.

Best,

nihalgupta84 commented 1 year ago

while resuming the training make_grid thorughing error

img_grid = self._make_image_grid(self.train_images)

File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 446, in _make_image_grid grid = make_grid(imgs, len(imgs)//len(dl_images)) ZeroDivisionError: integer division or modulo by zero

and one more error occuring while resuming in vcn model

bcoz of optimizer weights issue

(ptlflow) anil@anil-gpu2:/media/anil/New Volume1/Nihal/ptlflow$ python3 train.py vcn --logger --enable_checkpointing --gpus 2 --log_every_n_steps 100 --enable_progress_bar True --max_steps 100000 --train_batch_size 1 --train_dataset chairs-train --val_dataset chairs-val --accelerator gpu --strategy ddp_sharded --resume_from_checkpoint "/media/anil/New Volume1/Nihal/ptlflow/ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_last_epoch=6_step=77812.ckpt" 05/04/2023 13:30:49 - INFO: Loading faiss with AVX2 support. 05/04/2023 13:30:49 - INFO: Successfully loaded faiss with AVX2 support. Global seed set to 1234 05/04/2023 13:30:49 - INFO: Created a temporary directory at /tmp/tmps09nib89 05/04/2023 13:30:49 - INFO: Writing /tmp/tmps09nib89/_remote_module_non_scriptable.py 05/04/2023 13:31:09 - INFO: Loading 640 samples from FlyingChairs dataset. GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Global seed set to 1234 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 05/04/2023 13:31:12 - INFO: Loading faiss with AVX2 support. 05/04/2023 13:31:12 - INFO: Successfully loaded faiss with AVX2 support. Global seed set to 1234 05/04/2023 13:31:12 - INFO: Created a temporary directory at /tmp/tmp_bzotlcs 05/04/2023 13:31:12 - INFO: Writing /tmp/tmp_bzotlcs/_remote_module_non_scriptable.py 05/04/2023 13:31:35 - INFO: Loading 640 samples from FlyingChairs dataset. Global seed set to 1234 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 05/04/2023 13:31:35 - INFO: Added key: store_based_barrier_key:1 to store for rank: 1 05/04/2023 13:31:35 - INFO: Added key: store_based_barrier_key:1 to store for rank: 0 05/04/2023 13:31:35 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. 05/04/2023 13:31:35 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.

distributed_backend=nccl All distributed processes registered. Starting with 2 processes

Restoring states from the checkpoint path at /media/anil/New Volume1/Nihal/ptlflow/ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_last_epoch=6_step=77812.ckpt LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1] 05/04/2023 13:31:41 - WARNING: --train_crop_size is not set. It will be set as (320, 448). LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] 05/04/2023 13:31:41 - WARNING: --train_crop_size is not set. It will be set as (320, 448). 05/04/2023 13:31:44 - INFO: Loading 22232 samples from FlyingChairs dataset. 05/04/2023 13:31:44 - INFO: Loading 22232 samples from FlyingChairs dataset. 05/04/2023 13:31:46 - INFO: ShardedDDP bucket size: 0.00M parameters, model size 9.83M parameters 05/04/2023 13:31:46 - INFO: ShardedDDP bucket size: 0.00M parameters, model size 9.83M parameters

| Name | Type | Params

0 | loss_fn | VCNLoss | 0
1 | train_metrics | FlowMetrics | 0
2 | val_metrics | FlowMetrics | 0
3 | pspnet | pspnet | 1.8 M 4 | f6 | butterfly4D | 49.4 K 5 | p6 | sepConv4d | 4.6 K 6 | f5 | butterfly4D | 49.4 K 7 | p5 | sepConv4d | 4.6 K 8 | f4 | butterfly4D | 49.4 K 9 | p4 | sepConv4d | 4.6 K 10 | f3 | butterfly4D | 48.4 K 11 | p3 | sepConv4d | 4.6 K 12 | flow_reg64 | flow_reg | 0
13 | flow_reg32 | flow_reg | 0
14 | flow_reg16 | flow_reg | 0
15 | flow_reg8 | flow_reg | 0
16 | warp5 | WarpModule | 0
17 | warp4 | WarpModule | 0
18 | warp3 | WarpModule | 0
19 | warpx | WarpModule | 0
20 | dc6_conv1 | Sequential | 221 K 21 | dc6_conv2 | Sequential | 147 K 22 | dc6_conv3 | Sequential | 147 K 23 | dc6_conv4 | Sequential | 110 K 24 | dc6_conv5 | Sequential | 55.5 K 25 | dc6_conv6 | Sequential | 18.5 K 26 | dc6_conv7 | Conv2d | 9.2 K 27 | dc5_conv1 | Sequential | 295 K 28 | dc5_conv2 | Sequential | 147 K 29 | dc5_conv3 | Sequential | 147 K 30 | dc5_conv4 | Sequential | 110 K 31 | dc5_conv5 | Sequential | 55.5 K 32 | dc5_conv6 | Sequential | 18.5 K 33 | dc5_conv7 | Conv2d | 18.5 K 34 | dc4_conv1 | Sequential | 369 K 35 | dc4_conv2 | Sequential | 147 K 36 | dc4_conv3 | Sequential | 147 K 37 | dc4_conv4 | Sequential | 110 K 38 | dc4_conv5 | Sequential | 55.5 K 39 | dc4_conv6 | Sequential | 18.5 K 40 | dc4_conv7 | Conv2d | 27.7 K 41 | dc3_conv1 | Sequential | 369 K 42 | dc3_conv2 | Sequential | 147 K 43 | dc3_conv3 | Sequential | 147 K 44 | dc3_conv4 | Sequential | 110 K 45 | dc3_conv5 | Sequential | 55.5 K 46 | dc3_conv6 | Sequential | 18.5 K 47 | dc3_conv7 | Conv2d | 37.0 K 48 | dc6_convo | Sequential | 702 K 49 | dc5_convo | Sequential | 776 K 50 | dc4_convo | Sequential | 849 K 51 | dc3_convo | Sequential | 849 K 52 | f2 | butterfly4D | 27.9 K 53 | p2 | sepConv4d | 2.6 K 54 | flow_reg4 | flow_reg | 0
55 | warp2 | WarpModule | 0
56 | dc2_conv1 | Sequential | 424 K 57 | dc2_conv2 | Sequential | 147 K 58 | dc2_conv3 | Sequential | 147 K 59 | dc2_conv4 | Sequential | 110 K 60 | dc2_conv5 | Sequential | 55.5 K 61 | dc2_conv6 | Sequential | 18.5 K 62 | dc2_conv7 | Conv2d | 43.9 K 63 | dc2_convo | Sequential | 905 K

10.3 M Trainable params 0 Non-trainable params 10.3 M Total params 41.243 Total estimated model params size (MB) Traceback (most recent call last): File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 153, in train(args) File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 112, in train trainer.fit(model) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit self._call_and_handle_interrupt( File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1233, in _run self._checkpoint_connector.restore_training_state() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 204, in restore_training_state self.restore_optimizers_and_schedulers() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 306, in restore_optimizers_and_schedulers raise KeyError( KeyError: 'Trying to restore optimizer state but checkpoint contains only the model. This is probably due to ModelCheckpoint.save_weights_only being set to True.' Traceback (most recent call last): File "train.py", line 153, in train(args) File "train.py", line 112, in train trainer.fit(model) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit self._call_and_handle_interrupt( File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, *kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1233, in _run self._checkpoint_connector.restore_training_state() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 204, in restore_training_state self.restore_optimizers_and_schedulers() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 306, in restore_optimizers_and_schedulers raise KeyError( KeyError: 'Trying to restore optimizer state but checkpoint contains only the model. This is probably due to ModelCheckpoint.save_weights_only being set to True.'

hmorimitsu commented 1 year ago

Thank you. I'll take a look at make_grid later.

The resuming problem is caused because your example is trying to resume from the "last" checkpoint, which does not contain training states. To solve, you should resume from the "train" checkpoint.

Hope it helps.

nihalgupta84 commented 1 year ago

thanks for the quick reply

still facing issue while resuming the training for all models

and error we get from make_grid function so i have tried to add exception to handle this but got more error while, can you look into this

(ptlflow) anil@anil-gpu2:/media/anil/New Volume1/Nihal/ptlflow$ python3 train.py vcn --logger --enable_checkpointing --gpus 2 --log_every_n_steps 1000 --enable_progress_bar True --max_steps 100000 --train_batch_size 2 --train_dataset chairs-train --val_dataset chairs-val --accelerator gpu --strategy ddp_sharded --resume_from_checkpoint "ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_train_epoch=10_step=61138.ckpt" 05/05/2023 19:25:24 - INFO: Loading faiss with AVX2 support. 05/05/2023 19:25:24 - INFO: Successfully loaded faiss with AVX2 support. Global seed set to 1234 05/05/2023 19:25:24 - INFO: Created a temporary directory at /tmp/tmpyl682fg1 05/05/2023 19:25:24 - INFO: Writing /tmp/tmpyl682fg1/_remote_module_non_scriptable.py 05/05/2023 19:25:43 - INFO: Loading 640 samples from FlyingChairs dataset. GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Global seed set to 1234 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 05/05/2023 19:25:46 - INFO: Loading faiss with AVX2 support. 05/05/2023 19:25:46 - INFO: Successfully loaded faiss with AVX2 support. Global seed set to 1234 05/05/2023 19:25:46 - INFO: Created a temporary directory at /tmp/tmpwrz_g7s4 05/05/2023 19:25:46 - INFO: Writing /tmp/tmpwrz_g7s4/_remote_module_non_scriptable.py 05/05/2023 19:26:04 - INFO: Loading 640 samples from FlyingChairs dataset. Global seed set to 1234 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 05/05/2023 19:26:04 - INFO: Added key: store_based_barrier_key:1 to store for rank: 1 05/05/2023 19:26:04 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. 05/05/2023 19:26:04 - INFO: Added key: store_based_barrier_key:1 to store for rank: 0 05/05/2023 19:26:04 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.

distributed_backend=nccl All distributed processes registered. Starting with 2 processes

Restoring states from the checkpoint path at ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_train_epoch=10_step=61138.ckpt LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1] 05/05/2023 19:26:38 - WARNING: --train_crop_size is not set. It will be set as (320, 448). LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] 05/05/2023 19:26:38 - WARNING: --train_crop_size is not set. It will be set as (320, 448). 05/05/2023 19:26:40 - INFO: Loading 22232 samples from FlyingChairs dataset. 05/05/2023 19:26:40 - INFO: Loading 22232 samples from FlyingChairs dataset. 05/05/2023 19:26:41 - INFO: ShardedDDP bucket size: 0.00M parameters, model size 9.83M parameters 05/05/2023 19:26:41 - INFO: ShardedDDP bucket size: 0.00M parameters, model size 9.83M parameters

| Name | Type | Params

0 | loss_fn | VCNLoss | 0
1 | train_metrics | FlowMetrics | 0
2 | val_metrics | FlowMetrics | 0
3 | pspnet | pspnet | 1.8 M 4 | f6 | butterfly4D | 49.4 K 5 | p6 | sepConv4d | 4.6 K 6 | f5 | butterfly4D | 49.4 K 7 | p5 | sepConv4d | 4.6 K 8 | f4 | butterfly4D | 49.4 K 9 | p4 | sepConv4d | 4.6 K 10 | f3 | butterfly4D | 48.4 K 11 | p3 | sepConv4d | 4.6 K 12 | flow_reg64 | flow_reg | 0
13 | flow_reg32 | flow_reg | 0
14 | flow_reg16 | flow_reg | 0
15 | flow_reg8 | flow_reg | 0
16 | warp5 | WarpModule | 0
17 | warp4 | WarpModule | 0
18 | warp3 | WarpModule | 0
19 | warpx | WarpModule | 0
20 | dc6_conv1 | Sequential | 221 K 21 | dc6_conv2 | Sequential | 147 K 22 | dc6_conv3 | Sequential | 147 K 23 | dc6_conv4 | Sequential | 110 K 24 | dc6_conv5 | Sequential | 55.5 K 25 | dc6_conv6 | Sequential | 18.5 K 26 | dc6_conv7 | Conv2d | 9.2 K 27 | dc5_conv1 | Sequential | 295 K 28 | dc5_conv2 | Sequential | 147 K 29 | dc5_conv3 | Sequential | 147 K 30 | dc5_conv4 | Sequential | 110 K 31 | dc5_conv5 | Sequential | 55.5 K 32 | dc5_conv6 | Sequential | 18.5 K 33 | dc5_conv7 | Conv2d | 18.5 K 34 | dc4_conv1 | Sequential | 369 K 35 | dc4_conv2 | Sequential | 147 K 36 | dc4_conv3 | Sequential | 147 K 37 | dc4_conv4 | Sequential | 110 K 38 | dc4_conv5 | Sequential | 55.5 K 39 | dc4_conv6 | Sequential | 18.5 K 40 | dc4_conv7 | Conv2d | 27.7 K 41 | dc3_conv1 | Sequential | 369 K 42 | dc3_conv2 | Sequential | 147 K 43 | dc3_conv3 | Sequential | 147 K 44 | dc3_conv4 | Sequential | 110 K 45 | dc3_conv5 | Sequential | 55.5 K 46 | dc3_conv6 | Sequential | 18.5 K 47 | dc3_conv7 | Conv2d | 37.0 K 48 | dc6_convo | Sequential | 702 K 49 | dc5_convo | Sequential | 776 K 50 | dc4_convo | Sequential | 849 K 51 | dc3_convo | Sequential | 849 K 52 | f2 | butterfly4D | 27.9 K 53 | p2 | sepConv4d | 2.6 K 54 | flow_reg4 | flow_reg | 0
55 | warp2 | WarpModule | 0
56 | dc2_conv1 | Sequential | 424 K 57 | dc2_conv2 | Sequential | 147 K 58 | dc2_conv3 | Sequential | 147 K 59 | dc2_conv4 | Sequential | 110 K 60 | dc2_conv5 | Sequential | 55.5 K 61 | dc2_conv6 | Sequential | 18.5 K 62 | dc2_conv7 | Conv2d | 43.9 K 63 | dc2_convo | Sequential | 905 K

10.3 M Trainable params 0 Non-trainable params 10.3 M Total params 41.243 Total estimated model params size (MB) Restored all states from the checkpoint file at ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_train_epoch=10_step=61138.ckpt 05/05/2023 19:26:45 - INFO: Loading 22232 samples from FlyingChairs dataset. 05/05/2023 19:26:45 - INFO: Loading 22232 samples from FlyingChairs dataset. 05/05/2023 19:27:02 - INFO: Loading 640 samples from FlyingChairs dataset. 05/05/2023 19:27:03 - INFO: Loading 640 samples from FlyingChairs dataset. Epoch 10: 95%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 5558/5878 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 153, in train(args) File "train.py", line 112, in train trainer.fit(model) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit self._call_and_handle_interrupt( File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(*args, *kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run results = self._run_stage() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage return self._run_train() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1354, in _run_train self.fit_loop.run() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run self.on_advance_end() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 297, in on_advance_end self.trainer._call_callback_hooks("on_train_epoch_end") File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1637, in _call_callback_hooks fn(self, self.lightning_module, args, kwargs) File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 199, in on_train_epoch_end img_grid = self._make_image_grid(self.train_images) File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 446, in _make_image_grid grid = make_grid(imgs, len(imgs)//len(dl_images)) ZeroDivisionError: integer division or modulo by zero Epoch 10: 95%|█████████▍| 5558/5878 [00:27<?, ?it/s]
Traceback (most recent call last): File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 153, in train(args) File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 112, in train trainer.fit(model) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit self._call_and_handle_interrupt( File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run results = self._run_stage() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage return self._run_train() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1354, in _run_train self.fit_loop.run() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run self.on_advance_end() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 297, in on_advance_end self.trainer._call_callback_hooks("on_train_epoch_end") File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1637, in _call_callback_hooks fn(self, self.lightning_module, args, **kwargs) File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 199, in on_train_epoch_end img_grid = self._make_image_grid(self.train_images) File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 446, in _make_image_grid grid = make_grid(imgs, len(imgs)//len(dl_images)) ZeroDivisionError: integer division or modulo by zero

hmorimitsu commented 1 year ago

Which version of pytorch and pytorch-lightning you are using?

nihalgupta84 commented 1 year ago

pytorch-lightning 1.6.0 torch 1.12.0 torch-scatter 2.1.1 torchmetrics 0.9.0 torchvision 0.13.0

hmorimitsu commented 1 year ago

Could you upgrade pytorch-lightning to version 1.7.7 and try to resume again?

As you can see from the error, it is trying to resume from the end of an epoch, instead of the beginning. If I remember correctly, this was related to the lightning version.

However, do not try to install the latest pytorch-lightning either, as I have not tested with newer versions yet.

nihalgupta84 commented 1 year ago

upgraded pytorch lightning to 1.7.7, now training started from begining of the epoch but still error the same

Epoch 10: 0%| | 0/5878 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 153, in train(args) File "train.py", line 112, in train trainer.fit(model) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(*args, *kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train self.fit_loop.run() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 201, in run self.on_advance_end() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 299, in on_advance_end self.trainer._call_callback_hooks("on_train_epoch_end") File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1597, in _call_callback_hooks fn(self, self.lightning_module, args, kwargs) File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 199, in on_train_epoch_end img_grid = self._make_image_grid(self.train_images) File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 446, in _make_image_grid grid = make_grid(imgs, len(imgs)//len(dl_images)) ZeroDivisionError: integer division or modulo by zero Traceback (most recent call last): File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 153, in train(args) File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 112, in train trainer.fit(model) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train self.fit_loop.run() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 201, in run self.on_advance_end() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 299, in on_advance_end self.trainer._call_callback_hooks("on_train_epoch_end") File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1597, in _call_callback_hooks fn(self, self.lightning_module, args, **kwargs) File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 199, in on_train_epoch_end img_grid = self._make_image_grid(self.train_images) File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 446, in _make_image_grid grid = make_grid(imgs, len(imgs)//len(dl_images)) ZeroDivisionError: integer division or modulo by zero Epoch 10: 0%| | 0/5878 [00:21<?, ?it/s]

hmorimitsu commented 1 year ago

Pushed a fix to check if the outputs are empty in #49, it should solve this problem.

Please pull the new version and try again.

nihalgupta84 commented 1 year ago

Traceback (most recent call last): File "train.py", line 167, in train(args) File "train.py", line 113, in train trainer.fit(model) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(*args, *kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train self.fit_loop.run() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance batch_output = self.batch_loop.run(kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, *kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step self.trainer._call_lightning_module_hook( File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook output = fn(*args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1705, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 289, in optimizer_step optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper return func(args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/fairscale/optim/oss.py", line 232, in step loss = self.optim.step(closure=closure, kwargs) # type: ignore File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper return func(args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/adamw.py", line 161, in step adamw(params_with_grad, File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/adamw.py", line 218, in adamw func(params, File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/adamw.py", line 259, in _single_tensor_adamw assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors." AssertionError: If capturable=False, state_steps should not be CUDA tensors. Traceback (most recent call last): File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 167, in train(args) File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 113, in train trainer.fit(model) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train self.fit_loop.run() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance batch_output = self.batch_loop.run(kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, *kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step self.trainer._call_lightning_module_hook( File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook output = fn(*args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1705, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 289, in optimizer_step optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper return func(args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/fairscale/optim/oss.py", line 232, in step loss = self.optim.step(closure=closure, kwargs) # type: ignore File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/optimizer.py", line 109, in wrapper return func(args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/adamw.py", line 161, in step adamw(params_with_grad, File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/adamw.py", line 218, in adamw func(params, File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/optim/adamw.py", line 259, in _single_tensor_adamw assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors." AssertionError: If capturable=False, state_steps should not be CUDA tensors. Epoch 11: 0%| | 0/5878 [00:23<?, ?it/s, loss=nan, v_num=4]

this error originastes only when we are loading checkpoints for resuming the training or finetuning

hmorimitsu commented 1 year ago

It seems that this is a problem with pytorch 1.12.0. The solution seems to be to upgrade to 1.12.1. See more here:

https://github.com/pytorch/pytorch/issues/80809