openvinotoolkit / anomalib

An anomaly detection library comprising state-of-the-art algorithms and features such as experiment management, hyper-parameter optimization, and edge inference.
https://anomalib.readthedocs.io/en/latest/
Apache License 2.0
3.81k stars 675 forks source link

[Task]: Allow arbitary image sizes for EfficientAD #1528

Closed holzweber closed 9 months ago

holzweber commented 11 months ago

What is the motivation for this task?

As i need to keep the aspect ratio of my input images, I would like to input arbitrary inputs sizes into the EfficientAD model. Currently only images with width=height seem to work. Furthermore, I tried the following sizes so far (width = height): 1024,1024 → works 1000,1000 → works 512,512 → works 912,912 → does not work 1063,1063 → does not work

Describe the solution you'd like

Set an arbitrary input size for images in the configuration.yaml file for EfficientAD

Additional context

Error message when using shapes that did not work:

8.1 M Trainable params 0 Non-trainable params 8.1 M Total params 32.235 Total estimated model params size (MB)

Epoch 0: 0% 0/12 [00:00<?, ?it/s] Calculate teacher channel mean & std: 100% [00:13<00:00, 1.34s/it]

RuntimeError Traceback (most recent call last) Cell In[7], line 2 1 trainer = Trainer(**config.trainer, logger=experiment_logger, callbacks=callbacks) ----> 2 trainer.fit(model=model, datamodule=datamodule)

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\trainer\trainer.py:608, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path) 606 model = self._maybe_unwrap_optimized(model) 607 self.strategy._lightning_module = model --> 608 call._call_and_handle_interrupt( 609 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path 610 )

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\trainer\call.py:38, in _call_and_handle_interrupt(trainer, trainer_fn, *args, kwargs) 36 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs) 37 else: ---> 38 return trainer_fn(args, kwargs) 40 except _TunerExitException: 41 trainer._call_teardown_hook()

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\trainer\trainer.py:650, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path) 643 ckpt_path = ckpt_path or self.resume_from_checkpoint 644 self._ckpt_path = self._checkpoint_connector._set_ckpt_path( 645 self.state.fn, 646 ckpt_path, # type: ignore[arg-type] 647 model_provided=True, 648 model_connected=self.lightning_module is not None, 649 ) --> 650 self._run(model, ckpt_path=self.ckpt_path) 652 assert self.state.stopped 653 self.training = False

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\trainer\trainer.py:1112, in Trainer._run(self, model, ckpt_path) 1108 self._checkpoint_connector.restore_training_state() 1110 self._checkpoint_connector.resume_end() -> 1112 results = self._run_stage() 1114 log.detail(f"{self.class.name}: trainer tearing down") 1115 self._teardown()

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\trainer\trainer.py:1191, in Trainer._run_stage(self) 1189 if self.predicting: 1190 return self._run_predict() -> 1191 self._run_train()

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\trainer\trainer.py:1214, in Trainer._run_train(self) 1211 self.fit_loop.trainer = self 1213 with torch.autograd.set_detect_anomaly(self._detect_anomaly): -> 1214 self.fit_loop.run()

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\loop.py:199, in Loop.run(self, *args, kwargs) 197 try: 198 self.on_advance_start(*args, *kwargs) --> 199 self.advance(args, kwargs) 200 self.on_advance_end() 201 self._restarting = False

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\fit_loop.py:267, in FitLoop.advance(self) 265 self._data_fetcher.setup(dataloader, batch_to_device=batch_to_device) 266 with self.trainer.profiler.profile("run_training_epoch"): --> 267 self._outputs = self.epoch_loop.run(self._data_fetcher)

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\loop.py:199, in Loop.run(self, *args, kwargs) 197 try: 198 self.on_advance_start(*args, *kwargs) --> 199 self.advance(args, kwargs) 200 self.on_advance_end() 201 self._restarting = False

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py:213, in TrainingEpochLoop.advance(self, data_fetcher) 210 self.batch_progress.increment_started() 212 with self.trainer.profiler.profile("run_training_batch"): --> 213 batch_output = self.batch_loop.run(kwargs) 215 self.batch_progress.increment_processed() 217 # update non-plateau LR schedulers 218 # update epoch-interval ones only when we are at the end of training epoch

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\loop.py:199, in Loop.run(self, *args, kwargs) 197 try: 198 self.on_advance_start(*args, *kwargs) --> 199 self.advance(args, kwargs) 200 self.on_advance_end() 201 self._restarting = False

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py:88, in TrainingBatchLoop.advance(self, kwargs) 84 if self.trainer.lightning_module.automatic_optimization: 85 optimizers = _get_active_optimizers( 86 self.trainer.optimizers, self.trainer.optimizer_frequencies, kwargs.get("batch_idx", 0) 87 ) ---> 88 outputs = self.optimizer_loop.run(optimizers, kwargs) 89 else: 90 outputs = self.manual_loop.run(kwargs)

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\loop.py:199, in Loop.run(self, *args, kwargs) 197 try: 198 self.on_advance_start(*args, *kwargs) --> 199 self.advance(args, kwargs) 200 self.on_advance_end() 201 self._restarting = False

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py:202, in OptimizerLoop.advance(self, optimizers, kwargs) 199 def advance(self, optimizers: List[Tuple[int, Optimizer]], kwargs: OrderedDict) -> None: 200 kwargs = self._build_kwargs(kwargs, self.optimizer_idx, self._hiddens) --> 202 result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) 203 if result.loss is not None: 204 # automatic optimization assumes a loss needs to be returned for extras to be considered as the batch 205 # would be skipped otherwise 206 self._outputs[self.optimizer_idx] = result.asdict()

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py:249, in OptimizerLoop._run_optimization(self, kwargs, optimizer) 241 closure() 243 # ------------------------------ 244 # BACKWARD PASS 245 # ------------------------------ 246 # gradient update with accumulated gradients 247 else: 248 # the batch_idx is optional with inter-batch parallelism --> 249 self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) 251 result = closure.consume_result() 253 if result.loss is not None: 254 # if no result, user decided to skip optimization 255 # otherwise update running loss + reset accumulated loss 256 # TODO: find proper way to handle updating running loss

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py:370, in OptimizerLoop._optimizer_step(self, optimizer, opt_idx, batch_idx, train_step_and_backward_closure) 362 rank_zero_deprecation( 363 "The NVIDIA/apex AMP implementation has been deprecated upstream. Consequently, its integration inside" 364 " PyTorch Lightning has been deprecated in v1.9.0 and will be removed in v2.0.0." (...) 367 " return True." 368 ) 369 kwargs["using_native_amp"] = isinstance(self.trainer.precision_plugin, MixedPrecisionPlugin) --> 370 self.trainer._call_lightning_module_hook( 371 "optimizer_step", 372 self.trainer.current_epoch, 373 batch_idx, 374 optimizer, 375 opt_idx, 376 train_step_and_backward_closure, 377 on_tpu=isinstance(self.trainer.accelerator, TPUAccelerator), 378 **kwargs, # type: ignore[arg-type] 379 using_lbfgs=is_lbfgs, 380 ) 382 if not should_accumulate: 383 self.optim_progress.optimizer.step.increment_completed()

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\trainer\trainer.py:1356, in Trainer._call_lightning_module_hook(self, hook_name, pl_module, *args, *kwargs) 1353 pl_module._current_fx_name = hook_name 1355 with self.profiler.profile(f"[LightningModule]{pl_module.class.name}.{hook_name}"): -> 1356 output = fn(args, **kwargs) 1358 # restore current_fx when nested context 1359 pl_module._current_fx_name = prev_fx_name

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\core\module.py:1742, in LightningModule.optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using_lbfgs) 1663 def optimizer_step( 1664 self, 1665 epoch: int, (...) 1671 using_lbfgs: bool = False, 1672 ) -> None: 1673 r""" 1674 Override this method to adjust the default way the :class:~pytorch_lightning.trainer.trainer.Trainer calls 1675 each optimizer. (...) 1740 1741 """ -> 1742 optimizer.step(closure=optimizer_closure)

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\core\optimizer.py:169, in LightningOptimizer.step(self, closure, kwargs) 166 raise MisconfigurationException("When optimizer.step(closure) is called, the closure should be callable") 168 assert self._strategy is not None --> 169 step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs) 171 self._on_after_step() 173 return step_output

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\strategies\strategy.py:234, in Strategy.optimizer_step(self, optimizer, opt_idx, closure, model, kwargs) 232 # TODO(fabric): remove assertion once strategy's optimizer_step typing is fixed 233 assert isinstance(model, pl.LightningModule) --> 234 return self.precision_plugin.optimizer_step( 235 optimizer, model=model, optimizer_idx=opt_idx, closure=closure, kwargs 236 )

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\plugins\precision\precision_plugin.py:119, in PrecisionPlugin.optimizer_step(self, optimizer, model, optimizer_idx, closure, kwargs) 117 """Hook to run the optimizer step.""" 118 closure = partial(self._wrap_closure, model, optimizer, optimizer_idx, closure) --> 119 return optimizer.step(closure=closure, kwargs)

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\torch\optim\lr_scheduler.py:68, in LRScheduler.init..with_counter..wrapper(*args, *kwargs) 66 instance._step_count += 1 67 wrapped = func.get(instance, cls) ---> 68 return wrapped(args, **kwargs)

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\torch\optim\optimizer.py:373, in Optimizer.profile_hook_step..wrapper(*args, *kwargs) 368 else: 369 raise RuntimeError( 370 f"{func} must return None or a tuple of (new_args, new_kwargs), but got {result}." 371 ) --> 373 out = func(args, **kwargs) 374 self._optimizer_step_code() 376 # call optimizer step post hooks

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\torch\optim\optimizer.py:76, in _use_grad_for_differentiable.._use_grad(self, *args, *kwargs) 74 torch.set_grad_enabled(self.defaults['differentiable']) 75 torch._dynamo.graph_break() ---> 76 ret = func(self, args, **kwargs) 77 finally: 78 torch._dynamo.graph_break()

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\torch\optim\adam.py:143, in Adam.step(self, closure) 141 if closure is not None: 142 with torch.enable_grad(): --> 143 loss = closure() 145 for group in self.param_groups: 146 params_with_grad = []

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\plugins\precision\precision_plugin.py:105, in PrecisionPlugin._wrap_closure(self, model, optimizer, optimizer_idx, closure) 92 def _wrap_closure( 93 self, 94 model: "pl.LightningModule", (...) 97 closure: Callable[[], Any], 98 ) -> Any: 99 """This double-closure allows makes sure the closure is executed before the 100 on_before_optimizer_step hook is called. 101 102 The closure (generally) runs backward so this allows inspecting gradients in this hook. This structure is 103 consistent with the PrecisionPlugin subclasses that cannot pass optimizer.step(closure) directly. 104 """ --> 105 closure_result = closure() 106 self._after_closure(model, optimizer, optimizer_idx) 107 return closure_result

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py:149, in Closure.call(self, *args, kwargs) 148 def call(self, *args: Any, *kwargs: Any) -> Optional[Tensor]: --> 149 self._result = self.closure(args, kwargs) 150 return self._result.loss

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py:135, in Closure.closure(self, *args, *kwargs) 134 def closure(self, args: Any, **kwargs: Any) -> ClosureResult: --> 135 step_output = self._step_fn() 137 if step_output.closure_loss is None: 138 self.warning_cache.warn("training_step returned None. If this was on purpose, ignore this warning...")

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py:419, in OptimizerLoop._training_step(self, kwargs) 410 """Performs the actual train step with the tied hooks. 411 412 Args: (...) 416 A ClosureResult containing the training step output. 417 """ 418 # manually capture logged metrics --> 419 training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values()) 420 self.trainer.strategy.post_training_step() 422 model_output = self.trainer._call_lightning_module_hook("training_step_end", training_step_output)

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\trainer\trainer.py:1494, in Trainer._call_strategy_hook(self, hook_name, *args, *kwargs) 1491 return 1493 with self.profiler.profile(f"[Strategy]{self.strategy.class.name}.{hook_name}"): -> 1494 output = fn(args, **kwargs) 1496 # restore current_fx when nested context 1497 pl_module._current_fx_name = prev_fx_name

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\pytorch_lightning\strategies\strategy.py:378, in Strategy.training_step(self, *args, *kwargs) 376 with self.precision_plugin.train_step_context(): 377 assert isinstance(self.model, TrainingStep) --> 378 return self.model.training_step(args, **kwargs)

File D:\anomalib\src\anomalib\models\efficient_ad\lightning_model.py:246, in EfficientAd.training_step(failed resolving arguments) 243 self.imagenet_iterator = iter(self.imagenet_loader) 244 batch_imagenet = next(self.imagenet_iterator)[0]["image"].to(self.device) --> 246 loss_st, loss_ae, loss_stae = self.model(batch=batch["image"], batch_imagenet=batch_imagenet) 248 loss = loss_st + loss_ae + loss_stae 249 self.log("train_st", loss_st.item(), on_epoch=True, prog_bar=True, logger=True)

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\torch\nn\modules\module.py:1518, in Module._wrapped_call_impl(self, *args, kwargs) 1516 return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] 1517 else: -> 1518 return self._call_impl(args, kwargs)

File C:\Applications\Miniconda\envs\epp_next\lib\site-packages\torch\nn\modules\module.py:1527, in Module._call_impl(self, *args, *kwargs) 1522 # If we don't have any hooks, we want to skip the rest of the logic in 1523 # this function, and just call forward. 1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1525 or _global_backward_pre_hooks or _global_backward_hooks 1526 or _global_forward_hooks or _global_forward_pre_hooks): -> 1527 return forward_call(args, **kwargs) 1529 try: 1530 result = None

File D:\anomalib\src\anomalib\models\efficient_ad\torch_model.py:328, in EfficientAdModel.forward(self, batch, batch_imagenet, normalize) 324 teacher_output_aug = (teacher_output_aug - self.mean_std["mean"]) / self.mean_std["std"] 326 student_output_ae_aug = self.student(aug_img)[:, self.teacher_out_channels :, :, :] --> 328 distance_ae = torch.pow(teacher_output_aug - ae_output_aug, 2) 329 distance_stae = torch.pow(ae_output_aug - student_output_ae_aug, 2) 331 loss_ae = torch.mean(distance_ae)

RuntimeError: The size of tensor a (283) must match the size of tensor b (282) at non-singleton dimension 3

blaz-r commented 11 months ago

Hello. Are you using the latest version of Anomalib? Seems a lot like ##1352 , which I thought was fixed in #1355. I think there is still a possibility to get dimensions that don't work, due to reduction of shapes by a factor of two, which can cause issues with odd numbers.

holzweber commented 11 months ago

Hey! Yes i just cloned the main branch and it did not work. I was hoping that #1355 would have solved this, but somehow it still does not work. I also thought of using only even number due to the factor two thing, ... however, it is not working when using even numbers like (912, 912) for me 😕

blaz-r commented 11 months ago

That's unfortunate. In regards to odd numbers, it's not only the initial dimension, but what you get down the line. 912 / 2 = 456 / 2 = 228 / 2 = 114 / 2 = 57 which is no longer divisible. That means that fifth convolution can cause problems in autoencoder.

holzweber commented 11 months ago

ah yep, true!

That's unfortunate. In regards to odd numbers, it's not only the initial dimension, but what you get down the line. 912 / 2 = 456 / 2 = 228 / 2 = 114 / 2 = 57 which is no longer divisible. That means that fifth convolution can cause problems in autoencoder.

Ah yes, true! Unfortnuately, even if we would only use powers of two, the images would get too large, because sometimes we need a at least e.g. 1160 pixel. This would then end up in a size of 2048.

blaz-r commented 11 months ago

Yeah, that is quite unfortunate with models that require feature shape match from different architectures (AE and CNN in this case). But I think it doesn't necessarily need to be power of 2, just needs to be divisible as many times as there are downscale convolutions in the model. I.e. 1216 is divisible by 2 6 times, which should suffice given that EfficientAD AE has 6 convolutions (last one even being stride 1 so I'd figure 5 divisions would be enough).

holzweber commented 11 months ago

Yeah, that is quite unfortunate with models that require feature shape match from different architectures (AE and CNN in this case). But I think it doesn't necessarily need to be power of 2, just needs to be divisible as many times as there are downscale convolutions in the model. I.e. 1216 is divisible by 2 6 times, which should suffice given that EfficientAD AE has 6 convolutions (last one even being stride 1 so I'd figure 5 divisions would be enough).

I get your point, but I also tried size 1000,1000, which would only be divisible 3 times before reaching 125 and it worked

blaz-r commented 11 months ago

Yeah, looking at the code, CNN does two stride 2 avg pooling, meaning the image is divided by 4. Since the decoder part of AE uses interpolation to upsample, and on the last layer sets the size to image / 4 what I said shouldn't cause problems. I'm then not sure what would be the case, maybe the padding is wrong but I can't check that right now.

holzweber commented 11 months ago

Yeah, looking at the code, CNN does two stride 2 avg pooling, meaning the image is divided by 4. Since the decoder part of AE uses interpolation to upsample, and on the last layer sets the size to image / 4 what I said shouldn't cause problems. I'm then not sure what would be the case, maybe the padding is wrong but I can't check that right now.

I checked the self.last_upsample calculation and did some tests, comparing the output sizes of arbitrary image shapes. If you add 3 to the image height / width in this method, the shapes will match. I created a pull request for this, maybe you want to check

blaz-r commented 11 months ago

Interesting. But that all seems a bit hacky, doesn't it?

holzweber commented 11 months ago

Interesting. But that all seems a bit hacky, doesn't it?

Yep. However, I think the root of the problem is this calculation in the decoder network.. I would not do any changes of the PDN network.

Another solution, which i need to test, is if we replace the int() or floor operation by a ceiling. I think this should give the same solution, as int does essentially a floor operation for positive numbers. I will do some checks later

Edit: I checked it - all we need to change in the self.last_upsample calculation is to round up to the next integer (ceil) and not round down (floor) after the divison by 4.