CCMR - Githubissues

Harsh188 commented 2 weeks ago

The implementation of CCMR through ptlflow takes up significantly more GPU memory than what is stated in the SOTA paper.

For 160x120 images, I'm requiring more than 24GB of VRAM for training which is inconistant with the paper.

Log:

[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 169, in step
[rank0]:     step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 280, in optimizer_step
[rank0]:     optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 234, in optimizer_step
[rank0]:     return self.precision_plugin.optimizer_step(
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 119, in optimizer_step
[rank0]:     return optimizer.step(closure=closure, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank0]:     return wrapped(*args, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
[rank0]:     ret = func(self, *args, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/optim/adamw.py", line 165, in step
[rank0]:     loss = closure()
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 105, in _wrap_closure
[rank0]:     closure_result = closure()
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/optimizer_loop.py", line 149, in __call__
[rank0]:     self._result = self.closure(*args, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/optimizer_loop.py", line 135, in closure
[rank0]:     step_output = self._step_fn()
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/optimizer_loop.py", line 419, in _training_step
[rank0]:     training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1494, in _call_strategy_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 351, in training_step
[rank0]:     return self.model(*args, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
[rank0]:     return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/lightning/pytorch/overrides/base.py", line 98, in forward
[rank0]:     output = self._forward_module.training_step(*inputs, **kwargs)
[rank0]:   File "/home/hmohan/Programming/T-FlowModels/ptlflow/ptlflow/models/base_model/base_model.py", line 410, in training_step
[rank0]:     preds = self(batch)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/hmohan/Programming/T-FlowModels/ptlflow/ptlflow/models/ccmr/ccmr.py", line 301, in forward
[rank0]:     corr = corr_fn(coords1)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/hmohan/Programming/T-FlowModels/ptlflow/ptlflow/utils/correlation.py", line 505, in forward
[rank0]:     corr = iter_translated_spatial_correlation_sample(
[rank0]:   File "/home/hmohan/Programming/T-FlowModels/ptlflow/ptlflow/utils/correlation.py", line 345, in iter_translated_spatial_correlation_sample
[rank0]:     corr[:, i // dilation_patch[0], j // dilation_patch[1]] = (input1 * p2).sum(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 
Epoch 0:   0%|          | 0/993 [00:02<?, ?it/s]

hmorimitsu commented 2 weeks ago

Hello, thank you for raising this issue.

First, please note that training is not well supported in PTLFlow, and I cannot guarantee you will get good results with our generic training script. Moreover, I think CCMR did not open source their official training code, so I do know the specific settings of their training.

Regarding your memory problem, I think the main issue is that you are using the correlation block from ptlflow.utils.correlation.IterativeCorrBlock, which is just a simple and not-optimized iterative implementation. To match the original model, you should first compile the alt_cuda_corr package (check ptlflow/utils/external/alt_cuda_corr). With alt_cuda_corr, you should get faster speeds and lower memory consumption. I hope this solves your problem, but again, I do not guarantee you can match the paper.

If there are still other issues, just let me know.

Best.

Harsh188 commented 2 weeks ago

I was able to compile alt_cuda_corr I checked to ensure that I was able to import it as well.

It's not just that the memory matches the paper, it's no where near the results demonstrated by the paper.

(ptlflow) hmohan@WAN:~/Programming/T-FlowModels/ptlflow$ python train.py ccmr --train_dataset kaist --lr 0.0001 --train_batch_size 2 --max_epochs 5 --gpus 2 --no_alternate_corr 
###########################################################################
# WARNING, please read!                                                   #
#                                                                         #
# This training script has not been tested!                               #
# Therefore, there is no guarantee that a model trained with this script  #
# will produce good results after the training!                           #
#                                                                         #
# You can find more information at                                        #
# https://ptlflow.readthedocs.io/en/latest/starting/training.html         #
###########################################################################
Global seed set to 1234
Cross-att XCA
Cross-att XCA
Cross-att XCA
Self-att XCiT
Self-att XCiT
Self-att XCiT
> /home/hmohan/Programming/T-FlowModels/ptlflow/ptlflow/models/ccmr/ccmr.py(165)__init__()
-> if self.args.alternate_corr and alt_cuda_corr is None:
(Pdb) alt_cuda_corr
<module 'alt_cuda_corr' from '/home/hmohan/miniconda3/envs/ptlflow/lib/python3.10/site-packages/correlation-0.0.0-py3.10-linux-x86_64.egg/alt_cuda_corr.cpython-310-x86_64-linux-gnu.so'>

hmorimitsu / ptlflow

CCMR #70