mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
723 stars 130 forks source link

Ketos -r (deterministic behavior) returns RuntimeError #479

Closed alix-tz closed 1 year ago

alix-tz commented 1 year ago

On Kraken 4.3.7, using this command:

I get this:

Global seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ROCQ/almanach/achague/.local/bin/ketos:8 in <module>                                       │
│                                                                                                  │
│   5 from kraken.ketos import cli                                                                 │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(cli())                                                                          │
│   9                                                                                              │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/click/core.py:1130 in __call__    │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/click/core.py:1055 in main        │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/click/core.py:1657 in invoke      │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/click/core.py:1404 in invoke      │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/click/core.py:760 in invoke       │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/click/decorators.py:26 in         │
│ new_func                                                                                         │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/kraken/ketos/recognition.py:313   │
│ in train                                                                                         │
│                                                                                                  │
│   310 │   │   │   │   │   │   │   log_dir=log_dir,                                               │
│   311 │   │   │   │   │   │   │   **val_check_interval)                                          │
│   312 │   try:                                                                                   │
│ ❱ 313 │   │   trainer.fit(model)                                                                 │
│   314 │   except KrakenInputException as e:                                                      │
│   315 │   │   if e.args[0].startswith('Training data and model codec alphabets mismatch') and    │
│   316 │   │   │   raise click.BadOptionUsage('resize', 'Mismatched training data for loaded mo   │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/kraken/lib/train.py:113 in fit    │
│                                                                                                  │
│    110 │   │   with warnings.catch_warnings():                                                   │
│    111 │   │   │   warnings.filterwarnings(action='ignore', category=UserWarning,                │
│    112 │   │   │   │   │   │   │   │   │   message='The dataloader,')                            │
│ ❱  113 │   │   │   super().fit(*args, **kwargs)                                                  │
│    114                                                                                           │
│    115                                                                                           │
│    116 class KrakenFreezeBackbone(BaseFinetuning):                                               │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer │
│ .py:608 in fit                                                                                   │
│                                                                                                  │
│    605 │   │   """                                                                               │
│    606 │   │   model = self._maybe_unwrap_optimized(model)                                       │
│    607 │   │   self.strategy._lightning_module = model                                           │
│ ❱  608 │   │   call._call_and_handle_interrupt(                                                  │
│    609 │   │   │   self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule,  │
│    610 │   │   )                                                                                 │
│    611                                                                                           │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py │
│ :38 in _call_and_handle_interrupt                                                                │
│                                                                                                  │
│   35 │   │   if trainer.strategy.launcher is not None:                                           │
│   36 │   │   │   return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer,     │
│   37 │   │   else:                                                                               │
│ ❱ 38 │   │   │   return trainer_fn(*args, **kwargs)                                              │
│   39 │                                                                                           │
│   40 │   except _TunerExitException:                                                             │
│   41 │   │   trainer._call_teardown_hook()                                                       │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer │
│ .py:650 in _fit_impl                                                                             │
│                                                                                                  │
│    647 │   │   │   model_provided=True,                                                          │
│    648 │   │   │   model_connected=self.lightning_module is not None,                            │
│    649 │   │   )                                                                                 │
│ ❱  650 │   │   self._run(model, ckpt_path=self.ckpt_path)                                        │
│    651 │   │                                                                                     │
│    652 │   │   assert self.state.stopped                                                         │
│    653 │   │   self.training = False                                                             │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer │
│ .py:1112 in _run                                                                                 │
│                                                                                                  │
│   1109 │   │                                                                                     │
│   1110 │   │   self._checkpoint_connector.resume_end()                                           │
│   1111 │   │                                                                                     │
│ ❱ 1112 │   │   results = self._run_stage()                                                       │
│   1113 │   │                                                                                     │
│   1114 │   │   log.detail(f"{self.__class__.__name__}: trainer tearing down")                    │
│   1115 │   │   self._teardown()                                                                  │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer │
│ .py:1191 in _run_stage                                                                           │
│                                                                                                  │
│   1188 │   │   │   return self._run_evaluate()                                                   │
│   1189 │   │   if self.predicting:                                                               │
│   1190 │   │   │   return self._run_predict()                                                    │
│ ❱ 1191 │   │   self._run_train()                                                                 │
│   1192 │                                                                                         │
│   1193 │   def _pre_training_routine(self) -> None:                                              │
│   1194 │   │   # wait for all to join if on distributed                                          │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer │
│ .py:1214 in _run_train                                                                           │
│                                                                                                  │
│   1211 │   │   self.fit_loop.trainer = self                                                      │
│   1212 │   │                                                                                     │
│   1213 │   │   with torch.autograd.set_detect_anomaly(self._detect_anomaly):                     │
│ ❱ 1214 │   │   │   self.fit_loop.run()                                                           │
│   1215 │                                                                                         │
│   1216 │   def _run_evaluate(self) -> _EVALUATE_OUTPUT:                                          │
│   1217 │   │   assert self.evaluating                                                            │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py:1 │
│ 99 in run                                                                                        │
│                                                                                                  │
│   196 │   │   while not self.done:                                                               │
│   197 │   │   │   try:                                                                           │
│   198 │   │   │   │   self.on_advance_start(*args, **kwargs)                                     │
│ ❱ 199 │   │   │   │   self.advance(*args, **kwargs)                                              │
│   200 │   │   │   │   self.on_advance_end()                                                      │
│   201 │   │   │   │   self._restarting = False                                                   │
│   202 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop. │
│ py:267 in advance                                                                                │
│                                                                                                  │
│   264 │   │   assert self._data_fetcher is not None                                              │
│   265 │   │   self._data_fetcher.setup(dataloader, batch_to_device=batch_to_device)              │
│   266 │   │   with self.trainer.profiler.profile("run_training_epoch"):                          │
│ ❱ 267 │   │   │   self._outputs = self.epoch_loop.run(self._data_fetcher)                        │
│   268 │                                                                                          │
│   269 │   def on_advance_end(self) -> None:                                                      │
│   270 │   │   # inform logger the batch loop has finished                                        │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py:1 │
│ 99 in run                                                                                        │
│                                                                                                  │
│   196 │   │   while not self.done:                                                               │
│   197 │   │   │   try:                                                                           │
│   198 │   │   │   │   self.on_advance_start(*args, **kwargs)                                     │
│ ❱ 199 │   │   │   │   self.advance(*args, **kwargs)                                              │
│   200 │   │   │   │   self.on_advance_end()                                                      │
│   201 │   │   │   │   self._restarting = False                                                   │
│   202 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/tra │
│ ining_epoch_loop.py:213 in advance                                                               │
│                                                                                                  │
│   210 │   │   │   self.batch_progress.increment_started()                                        │
│   211 │   │   │                                                                                  │
│   212 │   │   │   with self.trainer.profiler.profile("run_training_batch"):                      │
│ ❱ 213 │   │   │   │   batch_output = self.batch_loop.run(kwargs)                                 │
│   214 │   │                                                                                      │
│   215 │   │   self.batch_progress.increment_processed()                                          │
│   216                                                                                            │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py:1 │
│ 99 in run                                                                                        │
│                                                                                                  │
│   196 │   │   while not self.done:                                                               │
│   197 │   │   │   try:                                                                           │
│   198 │   │   │   │   self.on_advance_start(*args, **kwargs)                                     │
│ ❱ 199 │   │   │   │   self.advance(*args, **kwargs)                                              │
│   200 │   │   │   │   self.on_advance_end()                                                      │
│   201 │   │   │   │   self._restarting = False                                                   │
│   202 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/batch/tra │
│ ining_batch_loop.py:88 in advance                                                                │
│                                                                                                  │
│    85 │   │   │   optimizers = _get_active_optimizers(                                           │
│    86 │   │   │   │   self.trainer.optimizers, self.trainer.optimizer_frequencies, kwargs.get(   │
│    87 │   │   │   )                                                                              │
│ ❱  88 │   │   │   outputs = self.optimizer_loop.run(optimizers, kwargs)                          │
│    89 │   │   else:                                                                              │
│    90 │   │   │   outputs = self.manual_loop.run(kwargs)                                         │
│    91 │   │   if outputs:                                                                        │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py:1 │
│ 99 in run                                                                                        │
│                                                                                                  │
│   196 │   │   while not self.done:                                                               │
│   197 │   │   │   try:                                                                           │
│   198 │   │   │   │   self.on_advance_start(*args, **kwargs)                                     │
│ ❱ 199 │   │   │   │   self.advance(*args, **kwargs)                                              │
│   200 │   │   │   │   self.on_advance_end()                                                      │
│   201 │   │   │   │   self._restarting = False                                                   │
│   202 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/optimizat │
│ ion/optimizer_loop.py:202 in advance                                                             │
│                                                                                                  │
│   199 │   def advance(self, optimizers: List[Tuple[int, Optimizer]], kwargs: OrderedDict) -> N   │
│   200 │   │   kwargs = self._build_kwargs(kwargs, self.optimizer_idx, self._hiddens)             │
│   201 │   │                                                                                      │
│ ❱ 202 │   │   result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.opt   │
│   203 │   │   if result.loss is not None:                                                        │
│   204 │   │   │   # automatic optimization assumes a loss needs to be returned for extras to b   │
│   205 │   │   │   # would be skipped otherwise                                                   │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/optimizat │
│ ion/optimizer_loop.py:249 in _run_optimization                                                   │
│                                                                                                  │
│   246 │   │   # gradient update with accumulated gradients                                       │
│   247 │   │   else:                                                                              │
│   248 │   │   │   # the `batch_idx` is optional with inter-batch parallelism                     │
│ ❱ 249 │   │   │   self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure   │
│   250 │   │                                                                                      │
│   251 │   │   result = closure.consume_result()                                                  │
│   252                                                                                            │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/optimizat │
│ ion/optimizer_loop.py:370 in _optimizer_step                                                     │
│                                                                                                  │
│   367 │   │   │   │   " return True."                                                            │
│   368 │   │   │   )                                                                              │
│   369 │   │   │   kwargs["using_native_amp"] = isinstance(self.trainer.precision_plugin, Mixed   │
│ ❱ 370 │   │   self.trainer._call_lightning_module_hook(                                          │
│   371 │   │   │   "optimizer_step",                                                              │
│   372 │   │   │   self.trainer.current_epoch,                                                    │
│   373 │   │   │   batch_idx,                                                                     │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer │
│ .py:1356 in _call_lightning_module_hook                                                          │
│                                                                                                  │
│   1353 │   │   pl_module._current_fx_name = hook_name                                            │
│   1354 │   │                                                                                     │
│   1355 │   │   with self.profiler.profile(f"[LightningModule]{pl_module.__class__.__name__}.{ho  │
│ ❱ 1356 │   │   │   output = fn(*args, **kwargs)                                                  │
│   1357 │   │                                                                                     │
│   1358 │   │   # restore current_fx when nested context                                          │
│   1359 │   │   pl_module._current_fx_name = prev_fx_name                                         │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/kraken/lib/train.py:639 in        │
│ optimizer_step                                                                                   │
│                                                                                                  │
│    636 │   │   │   │   │      optimizer_closure, on_tpu=False, using_native_amp=False,           │
│    637 │   │   │   │   │      using_lbfgs=False):                                                │
│    638 │   │   # update params                                                                   │
│ ❱  639 │   │   optimizer.step(closure=optimizer_closure)                                         │
│    640 │   │                                                                                     │
│    641 │   │   # linear warmup between 0 and the initial learning rate `lrate` in `warmup`       │
│    642 │   │   # steps.                                                                          │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/core/optimizer. │
│ py:169 in step                                                                                   │
│                                                                                                  │
│   166 │   │   │   raise MisconfigurationException("When `optimizer.step(closure)` is called, t   │
│   167 │   │                                                                                      │
│   168 │   │   assert self._strategy is not None                                                  │
│ ❱ 169 │   │   step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx   │
│   170 │   │                                                                                      │
│   171 │   │   self._on_after_step()                                                              │
│   172                                                                                            │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/stra │
│ tegy.py:234 in optimizer_step                                                                    │
│                                                                                                  │
│   231 │   │   model = model or self.lightning_module                                             │
│   232 │   │   # TODO(fabric): remove assertion once strategy's optimizer_step typing is fixed    │
│   233 │   │   assert isinstance(model, pl.LightningModule)                                       │
│ ❱ 234 │   │   return self.precision_plugin.optimizer_step(                                       │
│   235 │   │   │   optimizer, model=model, optimizer_idx=opt_idx, closure=closure, **kwargs       │
│   236 │   │   )                                                                                  │
│   237                                                                                            │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/precisi │
│ on/precision_plugin.py:119 in optimizer_step                                                     │
│                                                                                                  │
│   116 │   ) -> Any:                                                                              │
│   117 │   │   """Hook to run the optimizer step."""                                              │
│   118 │   │   closure = partial(self._wrap_closure, model, optimizer, optimizer_idx, closure)    │
│ ❱ 119 │   │   return optimizer.step(closure=closure, **kwargs)                                   │
│   120 │                                                                                          │
│   121 │   def _track_grad_norm(self, trainer: "pl.Trainer") -> None:                             │
│   122 │   │   if trainer.track_grad_norm == -1:                                                  │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/torch/optim/optimizer.py:140 in   │
│ wrapper                                                                                          │
│                                                                                                  │
│   137 │   │   │   │   obj, *_ = args                                                             │
│   138 │   │   │   │   profile_name = "Optimizer.step#{}.step".format(obj.__class__.__name__)     │
│   139 │   │   │   │   with torch.autograd.profiler.record_function(profile_name):                │
│ ❱ 140 │   │   │   │   │   out = func(*args, **kwargs)                                            │
│   141 │   │   │   │   │   obj._optimizer_step_code()                                             │
│   142 │   │   │   │   │   return out                                                             │
│   143                                                                                            │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/torch/optim/optimizer.py:23 in    │
│ _use_grad                                                                                        │
│                                                                                                  │
│    20 │   │   prev_grad = torch.is_grad_enabled()                                                │
│    21 │   │   try:                                                                               │
│    22 │   │   │   torch.set_grad_enabled(self.defaults['differentiable'])                        │
│ ❱  23 │   │   │   ret = func(self, *args, **kwargs)                                              │
│    24 │   │   finally:                                                                           │
│    25 │   │   │   torch.set_grad_enabled(prev_grad)                                              │
│    26 │   │   return ret                                                                         │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/torch/optim/adam.py:183 in step   │
│                                                                                                  │
│   180 │   │   loss = None                                                                        │
│   181 │   │   if closure is not None:                                                            │
│   182 │   │   │   with torch.enable_grad():                                                      │
│ ❱ 183 │   │   │   │   loss = closure()                                                           │
│   184 │   │                                                                                      │
│   185 │   │   for group in self.param_groups:                                                    │
│   186 │   │   │   params_with_grad = []                                                          │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/precisi │
│ on/precision_plugin.py:105 in _wrap_closure                                                      │
│                                                                                                  │
│   102 │   │   The closure (generally) runs ``backward`` so this allows inspecting gradients in   │
│   103 │   │   consistent with the ``PrecisionPlugin`` subclasses that cannot pass ``optimizer.   │
│   104 │   │   """                                                                                │
│ ❱ 105 │   │   closure_result = closure()                                                         │
│   106 │   │   self._after_closure(model, optimizer, optimizer_idx)                               │
│   107 │   │   return closure_result                                                              │
│   108                                                                                            │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/optimizat │
│ ion/optimizer_loop.py:149 in __call__                                                            │
│                                                                                                  │
│   146 │   │   return step_output                                                                 │
│   147 │                                                                                          │
│   148 │   def __call__(self, *args: Any, **kwargs: Any) -> Optional[Tensor]:                     │
│ ❱ 149 │   │   self._result = self.closure(*args, **kwargs)                                       │
│   150 │   │   return self._result.loss                                                           │
│   151                                                                                            │
│   152                                                                                            │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/optimizat │
│ ion/optimizer_loop.py:144 in closure                                                             │
│                                                                                                  │
│   141 │   │   │   self._zero_grad_fn()                                                           │
│   142 │   │                                                                                      │
│   143 │   │   if self._backward_fn is not None and step_output.closure_loss is not None:         │
│ ❱ 144 │   │   │   self._backward_fn(step_output.closure_loss)                                    │
│   145 │   │                                                                                      │
│   146 │   │   return step_output                                                                 │
│   147                                                                                            │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/loops/optimizat │
│ ion/optimizer_loop.py:305 in backward_fn                                                         │
│                                                                                                  │
│   302 │   │   │   return None                                                                    │
│   303 │   │                                                                                      │
│   304 │   │   def backward_fn(loss: Tensor) -> None:                                             │
│ ❱ 305 │   │   │   self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)         │
│   306 │   │                                                                                      │
│   307 │   │   return backward_fn                                                                 │
│   308                                                                                            │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer │
│ .py:1494 in _call_strategy_hook                                                                  │
│                                                                                                  │
│   1491 │   │   │   return                                                                        │
│   1492 │   │                                                                                     │
│   1493 │   │   with self.profiler.profile(f"[Strategy]{self.strategy.__class__.__name__}.{hook_  │
│ ❱ 1494 │   │   │   output = fn(*args, **kwargs)                                                  │
│   1495 │   │                                                                                     │
│   1496 │   │   # restore current_fx when nested context                                          │
│   1497 │   │   pl_module._current_fx_name = prev_fx_name                                         │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/stra │
│ tegy.py:207 in backward                                                                          │
│                                                                                                  │
│   204 │   │   assert self.lightning_module is not None                                           │
│   205 │   │   closure_loss = self.precision_plugin.pre_backward(closure_loss, self.lightning_m   │
│   206 │   │                                                                                      │
│ ❱ 207 │   │   self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, o   │
│   208 │   │                                                                                      │
│   209 │   │   closure_loss = self.precision_plugin.post_backward(closure_loss, self.lightning_   │
│   210 │   │   self.post_backward(closure_loss)                                                   │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/precisi │
│ on/precision_plugin.py:67 in backward                                                            │
│                                                                                                  │
│    64 │   │   │   │   :meth:`~torch.Tensor.backward`.                                            │
│    65 │   │   │   \**kwargs: Keyword arguments for the same purpose as ``*args``.                │
│    66 │   │   """                                                                                │
│ ❱  67 │   │   model.backward(tensor, optimizer, optimizer_idx, *args, **kwargs)                  │
│    68 │                                                                                          │
│    69 │   def post_backward(self, tensor: Tensor, module: "pl.LightningModule") -> Tensor:  #    │
│    70 │   │   # once backward has been applied, release graph                                    │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/pytorch_lightning/core/module.py: │
│ 1486 in backward                                                                                 │
│                                                                                                  │
│   1483 │   │   if self._fabric:                                                                  │
│   1484 │   │   │   self._fabric.backward(loss, *args, **kwargs)                                  │
│   1485 │   │   else:                                                                             │
│ ❱ 1486 │   │   │   loss.backward(*args, **kwargs)                                                │
│   1487 │                                                                                         │
│   1488 │   def toggle_optimizer(self, optimizer: Union[Optimizer, LightningOptimizer], optimize  │
│   1489 │   │   """Makes sure only the gradients of the current optimizer's parameters are calcu  │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/torch/_tensor.py:488 in backward  │
│                                                                                                  │
│    485 │   │   │   │   create_graph=create_graph,                                                │
│    486 │   │   │   │   inputs=inputs,                                                            │
│    487 │   │   │   )                                                                             │
│ ❱  488 │   │   torch.autograd.backward(                                                          │
│    489 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs                     │
│    490 │   │   )                                                                                 │
│    491                                                                                           │
│                                                                                                  │
│ /home/ROCQ/almanach/achague/.local/lib/python3.8/site-packages/torch/autograd/__init__.py:197 in │
│ backward                                                                                         │
│                                                                                                  │
│   194 │   # The reason we repeat same the comment below is that                                  │
│   195 │   # some Python versions print out the first line of a multi-line function               │
│   196 │   # calls in the traceback and some print out the last line                              │
│ ❱ 197 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   198 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   199 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   200                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: ctc_loss_backward_gpu does not have a deterministic implementation, but you set
'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the
'warn_only=True' option, if that's acceptable for your application. You can also file an issue at
https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

As mentioned in #302 , it seems that using ketos's -r is preferable over using -s 134 alone.

Do you have any recommandation to make reproducible trainings? And incidentally, any recommandation on how to make my command work? Is there anything I can do about it?

alix-tz commented 1 year ago

Just a follow up, for now I'm using -s (as mentioned in #397).

mittagessen commented 1 year ago

I've set the deterministic mode to 'warn' now. If you want truly deterministic training for recognition you'll have to skip CUDA as the CTC loss doesn't have a deterministic implementation but the differences between two runs are probably negligible.