Closed nihalgupta84 closed 1 year ago
Thank you for reporting, I'll check it later.
Just please notice that the training stage has not been tested, so there's no guarantee that the trained models will generate good results in the end.
Best,
i'll try to check every stage and will push the result.
But you need to check the flow estimator part.
it will be very appriciable, if you add some documentation about training.
i have tried training with batch_size for craft
iw worked.
ptl trainer inbuilt auto_batch_size is also not working.
it will be very appriciable, if you add some documentation about training.
There is a documentation at https://ptlflow.readthedocs.io/en/latest/starting/training.html.
Is there anything else in specific that you think it is missing?
I have pushed a fix for the losses in those models you mentioned.
I hope it is working now, but if not, let me know.
Best,
while resuming the training make_grid thorughing error
img_grid = self._make_image_grid(self.train_images)
File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/utils/callbacks/logger.py", line 446, in _make_image_grid grid = make_grid(imgs, len(imgs)//len(dl_images)) ZeroDivisionError: integer division or modulo by zero
and one more error occuring while resuming in vcn model
bcoz of optimizer weights issue
Restoring states from the checkpoint path at /media/anil/New Volume1/Nihal/ptlflow/ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_last_epoch=6_step=77812.ckpt LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1] 05/04/2023 13:31:41 - WARNING: --train_crop_size is not set. It will be set as (320, 448). LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] 05/04/2023 13:31:41 - WARNING: --train_crop_size is not set. It will be set as (320, 448). 05/04/2023 13:31:44 - INFO: Loading 22232 samples from FlyingChairs dataset. 05/04/2023 13:31:44 - INFO: Loading 22232 samples from FlyingChairs dataset. 05/04/2023 13:31:46 - INFO: ShardedDDP bucket size: 0.00M parameters, model size 9.83M parameters 05/04/2023 13:31:46 - INFO: ShardedDDP bucket size: 0.00M parameters, model size 9.83M parameters
10.3 M Trainable params
0 Non-trainable params
10.3 M Total params
41.243 Total estimated model params size (MB)
Traceback (most recent call last):
File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 153, in ModelCheckpoint.save_weights_only
being set to True
.'
Traceback (most recent call last):
File "train.py", line 153, in ModelCheckpoint.save_weights_only
being set to True
.'
Thank you. I'll take a look at make_grid later.
The resuming problem is caused because your example is trying to resume from the "last" checkpoint, which does not contain training states. To solve, you should resume from the "train" checkpoint.
Hope it helps.
thanks for the quick reply
still facing issue while resuming the training for all models
and error we get from make_grid function so i have tried to add exception to handle this but got more error while, can you look into this
Restoring states from the checkpoint path at ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_train_epoch=10_step=61138.ckpt LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1] 05/05/2023 19:26:38 - WARNING: --train_crop_size is not set. It will be set as (320, 448). LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] 05/05/2023 19:26:38 - WARNING: --train_crop_size is not set. It will be set as (320, 448). 05/05/2023 19:26:40 - INFO: Loading 22232 samples from FlyingChairs dataset. 05/05/2023 19:26:40 - INFO: Loading 22232 samples from FlyingChairs dataset. 05/05/2023 19:26:41 - INFO: ShardedDDP bucket size: 0.00M parameters, model size 9.83M parameters 05/05/2023 19:26:41 - INFO: ShardedDDP bucket size: 0.00M parameters, model size 9.83M parameters
10.3 M Trainable params
0 Non-trainable params
10.3 M Total params
41.243 Total estimated model params size (MB)
Restored all states from the checkpoint file at ptlflow_logs/vcn-chairs/lightning_logs/version_0/checkpoints/vcn_train_epoch=10_step=61138.ckpt
05/05/2023 19:26:45 - INFO: Loading 22232 samples from FlyingChairs dataset.
05/05/2023 19:26:45 - INFO: Loading 22232 samples from FlyingChairs dataset.
05/05/2023 19:27:02 - INFO: Loading 640 samples from FlyingChairs dataset.
05/05/2023 19:27:03 - INFO: Loading 640 samples from FlyingChairs dataset.
Epoch 10: 95%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 5558/5878 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 153, in
Traceback (most recent call last):
File "/media/anil/New Volume1/Nihal/ptlflow/train.py", line 153, in
Which version of pytorch and pytorch-lightning you are using?
pytorch-lightning 1.6.0 torch 1.12.0 torch-scatter 2.1.1 torchmetrics 0.9.0 torchvision 0.13.0
Could you upgrade pytorch-lightning to version 1.7.7 and try to resume again?
As you can see from the error, it is trying to resume from the end of an epoch, instead of the beginning. If I remember correctly, this was related to the lightning version.
However, do not try to install the latest pytorch-lightning either, as I have not tested with newer versions yet.
upgraded pytorch lightning to 1.7.7, now training started from begining of the epoch but still error the same
Epoch 10: 0%| | 0/5878 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 153, in
Pushed a fix to check if the outputs are empty in #49, it should solve this problem.
Please pull the new version and try again.
Traceback (most recent call last):
File "train.py", line 167, in
this error originastes only when we are loading checkpoints for resuming the training or finetuning
It seems that this is a problem with pytorch 1.12.0. The solution seems to be to upgrade to 1.12.1. See more here:
File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 134, in closure step_output = self._step_fn() File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 427, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", step_kwargs.values()) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1766, in _call_strategy_hook output = fn(args, kwargs) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 333, in training_step return self.model.training_step(*args, *kwargs) File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/models/base_model/base_model.py", line 229, in training_step loss = self.loss_fn(preds, batch) File "/home/anil/miniconda3/envs/ptlflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/media/anil/New Volume1/Nihal/ptlflow/ptlflow/models/gmflow/gmflow.py", line 40, in forward flow_loss += i_weight (valid[:, None] i_loss).mean() RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 2 Epoch 0: 0%| | 0/30712 [00:08<?, ?it/s]
can you please check the code?