mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
751 stars 131 forks source link

pretrain: UnboundLocalError #626

Closed katharinaost closed 4 months ago

katharinaost commented 4 months ago

Trying to run a minimal pretraining job without specifying a validation set or using forced splits I get an error concerning an unassigned variable:

root@local:/workspace# ketos pretrain -f binary training.arrow ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /usr/local/bin/ketos:8 in │ │ │ │ 5 from kraken.ketos import cli │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(cli()) │ │ 9 │ │ │ │ /usr/local/lib/python3.10/dist-packages/click/core.py:1157 in call │ │ │ │ /usr/local/lib/python3.10/dist-packages/click/core.py:1078 in main │ │ │ │ /usr/local/lib/python3.10/dist-packages/click/core.py:1688 in invoke │ │ │ │ /usr/local/lib/python3.10/dist-packages/click/core.py:1434 in invoke │ │ │ │ /usr/local/lib/python3.10/dist-packages/click/core.py:783 in invoke │ │ │ │ /usr/local/lib/python3.10/dist-packages/click/decorators.py:33 in new_func │ │ │ │ /usr/local/lib/python3.10/dist-packages/kraken/ketos/pretrain.py:261 in pretrain │ │ │ │ 258 │ │ │ │ │ │ │ │ │ load_hyper_parameters=load_hyper_parameters, │ │ 259 │ │ │ │ │ │ │ │ │ legacy_polygons=legacy_polygons) │ │ 260 │ │ │ ❱ 261 │ data_module = PretrainDataModule(batch_size=hyper_params.pop('batch_size'), │ │ 262 │ │ │ │ │ │ │ │ │ pad=hyper_params.pop('pad'), │ │ 263 │ │ │ │ │ │ │ │ │ augment=hyper_params.pop('augment'), │ │ 264 │ │ │ │ │ │ │ │ │ training_data=ground_truth, │ │ │ │ /usr/local/lib/python3.10/dist-packages/kraken/lib/pretrain/model.py:207 in init │ │ │ │ 204 │ │ │ │ 205 │ │ if format_type == 'binary': │ │ 206 │ │ │ legacy_train_status = train_set.legacy_polygons_status │ │ ❱ 207 │ │ │ if val_set and val_set.legacy_polygons_status != legacy_train_status: │ │ 208 │ │ │ │ logger.warning('Train and validation set have different legacy ' │ │ 209 │ │ │ │ │ │ │ f'polygon status: {legacy_train_status} and ' │ │ 210 │ │ │ │ │ │ │ f'{val_set.legacy_polygons_status}. Train set ' │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ UnboundLocalError: local variable 'val_set' referenced before assignment

It seems to me that the issue may be that line 207 uses val_set rather than self.val_set - and while val_set is set in the codepaths starting in lines 185 and 190 respectively, it remains unassigned in the "else" codepath starting in line 196.

The offending code was added in this commit about 4 months ago.

mittagessen commented 4 months ago

Sorry for the delay. It was a very stupid error introduced by the new line extractor implementation. It should be fixed in main now.