mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
751 stars 131 forks source link

unexpected class mappings mismatch after roadd #549

Closed Svetlana-Yatsyk closed 11 months ago

Svetlana-Yatsyk commented 1 year ago

Hello,

I am discovering the reading order models and I have a couple of questions about their training.

First, I trained a RO model on the data exported from eScriptorium. "default" lines are not present in my ontology. I double checked the PAGE files: tag "default" is nowhere to be found.

However, after tring to add the RO model to the segmentation model, I get this error:

scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API. Adding /content/gdrive/MyDrive/reading_order_models/RO_242.mlmodel reading order model to /content/gdrive/MyDrive/yaltai/segm_baselines_medieval_Thibault.mlmodel. Line classes known to RO model: DefaultLine 1 > default 2 Line classes known to segmentation model: DefaultLine 2 Usage: ketos roadd [OPTIONS] Try 'ketos roadd --help' for help.

Error: Model /content/gdrive/MyDrive/yaltai/segm_baselines_medieval_Thibault.mlmodel and /content/gdrive/MyDrive/reading_order_models/RO_242.mlmodel class mappings mismatch.

Where does "default" line class come from in my RO model?

And the second question: why does the trainin process, stopped because of the early stopping, gives me this error: TypeError: '>=' not supported between instances of 'int' and 'str' ?

mittagessen commented 1 year ago

On 23/10/19 01:33AM, Svetlana Yatsyk wrote:

Hello,

I am discovering the reading order models and I have a couple of questions about their training.

First, I trained a RO model on the data exported from eScriptorium. "default" lines are not present in my ontology. I double checked the PAGE files: tag "default" is nowhere to be found.

However, after tring to add the RO model to the segmentation model, I get this error:

scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API. Adding /content/gdrive/MyDrive/reading_order_models/RO_242.mlmodel reading order model to /content/gdrive/MyDrive/yaltai/segm_baselines_medieval_Thibault.mlmodel. Line classes known to RO model: DefaultLine 1 > default 2 Line classes known to segmentation model: DefaultLine 2 Usage: ketos roadd [OPTIONS] Try 'ketos roadd --help' for help.

Error: Model /content/gdrive/MyDrive/yaltai/segm_baselines_medieval_Thibault.mlmodel and /content/gdrive/MyDrive/reading_order_models/RO_242.mlmodel class mappings mismatch.

Where does "default" line class come from in my RO model?

Any line that doesn't have a class is assigned to default. There's probably one or more stray lines that aren't more specifically annotated.

And the second question: why does the trainin process, stopped because of the early stopping, gives me this error: TypeError: '>=' not supported between instances of 'int' and 'str' ?

That's probably a bug. Could you give me the whole traceback? It should say more specifically where the error occurred.

Svetlana-Yatsyk commented 1 year ago

Here it is:

stage 534/∞ ━━━━━━━━━━━━━━━━━ 23/23 0:00:17 • 0:00:00 1.35it/s val_spearman: early_stopping:
0.059 val_loss: 300/300 0.05816
0.183
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /usr/local/bin/ketos:8 in │ │ │ │ 5 from kraken.ketos import cli │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(cli()) │ │ 9 │ │ │ │ /usr/local/lib/python3.9/dist-packages/click/core.py:1157 in call │ │ │ │ /usr/local/lib/python3.9/dist-packages/click/core.py:1078 in main │ │ │ │ /usr/local/lib/python3.9/dist-packages/click/core.py:1688 in invoke │ │ │ │ /usr/local/lib/python3.9/dist-packages/click/core.py:1434 in invoke │ │ │ │ /usr/local/lib/python3.9/dist-packages/click/core.py:783 in invoke │ │ │ │ /usr/local/lib/python3.9/dist-packages/click/decorators.py:33 in new_func │ │ │ │ /usr/local/lib/python3.9/dist-packages/kraken/ketos/ro.py:257 in rotrain │ │ │ │ 254 │ │ │ │ │ │ │ *val_check_interval) │ │ 255 │ │ │ 256 │ with threadpoollimits(limits=threads): │ │ ❱ 257 │ │ trainer.fit(model) │ │ 258 │ │ │ 259 │ if quit == 'early': │ │ 260 │ │ message('Moving best model {0}{1}.mlmodel ({2}) to {0}_best.mlmodel'.format( │ │ │ │ /usr/local/lib/python3.9/dist-packages/kraken/lib/train.py:126 in fit │ │ │ │ 123 │ │ with warnings.catch_warnings(): │ │ 124 │ │ │ warnings.filterwarnings(action='ignore', category=UserWarning, │ │ 125 │ │ │ │ │ │ │ │ │ message='The dataloader,') │ │ ❱ 126 │ │ │ super().fit(args, kwargs) │ │ 127 │ │ 128 │ │ 129 class KrakenFreezeBackbone(BaseFinetuning): │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py:532 in fit │ │ │ │ 529 │ │ model = _maybe_unwrap_optimized(model) │ │ 530 │ │ self.strategy._lightning_module = model │ │ 531 │ │ _verify_strategy_supports_compile(model, self.strategy) │ │ ❱ 532 │ │ call._call_and_handle_interrupt( │ │ 533 │ │ │ self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, │ │ 534 │ │ ) │ │ 535 │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/call.py:43 in │ │ _call_and_handle_interrupt │ │ │ │ 40 │ try: │ │ 41 │ │ if trainer.strategy.launcher is not None: │ │ 42 │ │ │ return trainer.strategy.launcher.launch(trainer_fn, args, trainer=trainer, │ │ ❱ 43 │ │ return trainer_fn(args, kwargs) │ │ 44 │ │ │ 45 │ except _TunerExitException: │ │ 46 │ │ _call_teardown_hook(trainer) │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py:571 in _fit_impl │ │ │ │ 568 │ │ │ model_provided=True, │ │ 569 │ │ │ model_connected=self.lightning_module is not None, │ │ 570 │ │ ) │ │ ❱ 571 │ │ self._run(model, ckpt_path=ckpt_path) │ │ 572 │ │ │ │ 573 │ │ assert self.state.stopped │ │ 574 │ │ self.training = False │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py:980 in _run │ │ │ │ 977 │ │ # ---------------------------- │ │ 978 │ │ # RUN THE TRAINER │ │ 979 │ │ # ---------------------------- │ │ ❱ 980 │ │ results = self._run_stage() │ │ 981 │ │ │ │ 982 │ │ # ---------------------------- │ │ 983 │ │ # POST-Training CLEAN UP │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py:1023 in _run_stage │ │ │ │ 1020 │ │ │ with isolate_rng(): │ │ 1021 │ │ │ │ self._run_sanity_check() │ │ 1022 │ │ │ with torch.autograd.set_detect_anomaly(self._detect_anomaly): │ │ ❱ 1023 │ │ │ │ self.fit_loop.run() │ │ 1024 │ │ │ return None │ │ 1025 │ │ raise RuntimeError(f"Unexpected state {self.state}") │ │ 1026 │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/loops/fit_loop.py:199 in run │ │ │ │ 196 │ │ │ return │ │ 197 │ │ self.reset() │ │ 198 │ │ self.on_run_start() │ │ ❱ 199 │ │ while not self.done: │ │ 200 │ │ │ try: │ │ 201 │ │ │ │ self.on_advance_start() │ │ 202 │ │ │ │ self.advance() │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/loops/fit_loop.py:180 in done │ │ │ │ 177 │ │ │ rank_zero_info(f"Trainer.fit stopped: max_epochs={self.max_epochs!r} rea │ │ 178 │ │ │ return True │ │ 179 │ │ │ │ ❱ 180 │ │ if self.trainer.should_stop and self._can_stop_early: │ │ 181 │ │ │ rank_zero_debug("Trainer.fit stopped: trainer.should_stop was set.") │ │ 182 │ │ │ return True │ │ 183 │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/loops/fit_loop.py:147 in │ │ _can_stop_early │ │ │ │ 144 │ │ │ 145 │ @property │ │ 146 │ def _can_stop_early(self) -> bool: │ │ ❱ 147 │ │ met_min_epochs = self.epoch_progress.current.processed >= self.min_epochs if sel │ │ 148 │ │ met_min_steps = self.epoch_loop.global_step >= self.min_steps if self.min_steps │ │ 149 │ │ return met_min_epochs and met_min_steps │ │ 150 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: '>=' not supported between instances of 'int' and 'str'

mittagessen commented 1 year ago

Thanks. I'll see if I can reproduce it but it looks like a bug in pytorch-lightning.

Svetlana-Yatsyk commented 1 year ago

I am sorry for bothering you, but I still have the class mapping issue. I checked again, all the lines in my dataset are annotated (custom="structure {type:DefaultLine;}"). Here are the files I use: plutei_28_sin2.zip. However, the RO model trained on this data still knows two classes.

I ran a test with only 3 pages, and still got the same error.

mittagessen commented 1 year ago

I ran a test with only 3 pages, and still got the same error.

OK, then there's probably an issue with a default class which I didn't test during development. I'm teaching today but will have some time to check it during the weekend.

Svetlana-Yatsyk commented 1 year ago

Dear Ben, do you by chance have any news on this subject?

rohanchn commented 1 year ago

This might be naive, but I'll still ask: wouldn't it be reasonable to map only those classes that are common in both ro_net and seg_net and ignore the rest in ro_net. Sometimes a user might exclude a few less frequent classes when training a segmentation model with -vb but rotrain doesn't provide any such option, and I think I can also see why. In that case, roadd will always fail.

mittagessen commented 1 year ago

On 23/10/24 02:23AM, Svetlana Yatsyk wrote:

Dear Ben, do you by chance have any news on this subject?

Yes, sorry for the delay. I've found the bug, the fix is in the process of being merged.

Svetlana-Yatsyk commented 1 year ago

Thank you for looking at it!

I tried to train a model on 2 images to check whether the bug was gone. After reaching the set number of epochs, the training stopped with an error (the one you explained by a bug in pytorch-lightning)

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /usr/local/bin/ketos:8 in │ │ │ │ 5 from kraken.ketos import cli │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(cli()) │ │ 9 │ │ │ │ /usr/local/lib/python3.9/dist-packages/click/core.py:1157 in call │ │ │ │ /usr/local/lib/python3.9/dist-packages/click/core.py:1078 in main │ │ │ │ /usr/local/lib/python3.9/dist-packages/click/core.py:1688 in invoke │ │ │ │ /usr/local/lib/python3.9/dist-packages/click/core.py:1434 in invoke │ │ │ │ /usr/local/lib/python3.9/dist-packages/click/core.py:783 in invoke │ │ │ │ /usr/local/lib/python3.9/dist-packages/click/decorators.py:33 in new_func │ │ │ │ /usr/local/lib/python3.9/dist-packages/kraken/ketos/ro.py:257 in rotrain │ │ │ │ 254 │ │ │ │ │ │ │ *val_check_interval) │ │ 255 │ │ │ 256 │ with threadpoollimits(limits=threads): │ │ ❱ 257 │ │ trainer.fit(model) │ │ 258 │ │ │ 259 │ if quit == 'early': │ │ 260 │ │ message('Moving best model {0}{1}.mlmodel ({2}) to {0}_best.mlmodel'.format( │ │ │ │ /usr/local/lib/python3.9/dist-packages/kraken/lib/train.py:126 in fit │ │ │ │ 123 │ │ with warnings.catch_warnings(): │ │ 124 │ │ │ warnings.filterwarnings(action='ignore', category=UserWarning, │ │ 125 │ │ │ │ │ │ │ │ │ message='The dataloader,') │ │ ❱ 126 │ │ │ super().fit(args, kwargs) │ │ 127 │ │ 128 │ │ 129 class KrakenFreezeBackbone(BaseFinetuning): │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py:532 in fit │ │ │ │ 529 │ │ model = _maybe_unwrap_optimized(model) │ │ 530 │ │ self.strategy._lightning_module = model │ │ 531 │ │ _verify_strategy_supports_compile(model, self.strategy) │ │ ❱ 532 │ │ call._call_and_handle_interrupt( │ │ 533 │ │ │ self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, │ │ 534 │ │ ) │ │ 535 │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/call.py:43 in │ │ _call_and_handle_interrupt │ │ │ │ 40 │ try: │ │ 41 │ │ if trainer.strategy.launcher is not None: │ │ 42 │ │ │ return trainer.strategy.launcher.launch(trainer_fn, args, trainer=trainer, │ │ ❱ 43 │ │ return trainer_fn(args, kwargs) │ │ 44 │ │ │ 45 │ except _TunerExitException: │ │ 46 │ │ _call_teardown_hook(trainer) │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py:571 in _fit_impl │ │ │ │ 568 │ │ │ model_provided=True, │ │ 569 │ │ │ model_connected=self.lightning_module is not None, │ │ 570 │ │ ) │ │ ❱ 571 │ │ self._run(model, ckpt_path=ckpt_path) │ │ 572 │ │ │ │ 573 │ │ assert self.state.stopped │ │ 574 │ │ self.training = False │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py:980 in _run │ │ │ │ 977 │ │ # ---------------------------- │ │ 978 │ │ # RUN THE TRAINER │ │ 979 │ │ # ---------------------------- │ │ ❱ 980 │ │ results = self._run_stage() │ │ 981 │ │ │ │ 982 │ │ # ---------------------------- │ │ 983 │ │ # POST-Training CLEAN UP │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py:1023 in _run_stage │ │ │ │ 1020 │ │ │ with isolate_rng(): │ │ 1021 │ │ │ │ self._run_sanity_check() │ │ 1022 │ │ │ with torch.autograd.set_detect_anomaly(self._detect_anomaly): │ │ ❱ 1023 │ │ │ │ self.fit_loop.run() │ │ 1024 │ │ │ return None │ │ 1025 │ │ raise RuntimeError(f"Unexpected state {self.state}") │ │ 1026 │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/loops/fit_loop.py:199 in run │ │ │ │ 196 │ │ │ return │ │ 197 │ │ self.reset() │ │ 198 │ │ self.on_run_start() │ │ ❱ 199 │ │ while not self.done: │ │ 200 │ │ │ try: │ │ 201 │ │ │ │ self.on_advance_start() │ │ 202 │ │ │ │ self.advance() │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/loops/fit_loop.py:180 in done │ │ │ │ 177 │ │ │ rank_zero_info(f"Trainer.fit stopped: max_epochs={self.max_epochs!r} rea │ │ 178 │ │ │ return True │ │ 179 │ │ │ │ ❱ 180 │ │ if self.trainer.should_stop and self._can_stop_early: │ │ 181 │ │ │ rank_zero_debug("Trainer.fit stopped: trainer.should_stop was set.") │ │ 182 │ │ │ return True │ │ 183 │ │ │ │ /usr/local/lib/python3.9/dist-packages/pytorch_lightning/loops/fit_loop.py:147 in │ │ _can_stop_early │ │ │ │ 144 │ │ │ 145 │ @property │ │ 146 │ def _can_stop_early(self) -> bool: │ │ ❱ 147 │ │ met_min_epochs = self.epoch_progress.current.processed >= self.min_epochs if sel │ │ 148 │ │ met_min_steps = self.epoch_loop.global_step >= self.min_steps if self.min_steps │ │ 149 │ │ return met_min_epochs and met_min_steps │ │ 150 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: '>=' not supported between instances of 'int' and 'str'

However, I got several models, which I tried to add to a segmentation model, but did not succeed, again, because of the class mapping mismatch.

Svetlana-Yatsyk commented 12 months ago

I am training a model on these 2 xml files: https://drive.google.com/drive/folders/1-hj_dO9EOLX20nSSp7DgD4MfoX6ZUtY6?usp=sharing All the lines have {type:DefaultLine;}, there is not a single "default" line. However, when I launch the training, I see, that the "default" lines are present in the training data.

Screenshot 2023-12-01 at 15 47 36

Please, help me understand the reasoning behind it.

mittagessen commented 11 months ago

Sorry, I had screwed up the merge into the main branch and for some reason the earlier fix didn't get in there. You should now be able to train reading orders from main branch kraken without spurious line types.