mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
751 stars 131 forks source link

Added some tests to check that Arrows, XML, fine-tuning and codec work nicely together #476

Closed PonteIneptique closed 1 year ago

PonteIneptique commented 1 year ago

Hi there :) We were tracking some bug the other day and we prepared some dummy dataset to track it. I thought it would be good to kinda propose integrated tests which checks for a suite of combination (Unicode normalization, Fine-Tuning, Arrows or XML).

I left the data to generate the arrow in case arrow change one day and we need to regenerate the arrows.

These checks a things a little different than test_train.py, so I hope you will see it as useful.

mittagessen commented 1 year ago

Thanks, I wanted to write some integration tests for the ptl modules but stopped after ending up with waay too many cases. It's a good start and I'll probably add some more in the future.

PonteIneptique commented 1 year ago

Thanks for the merge :) Just a quick note: I am still quite unsatisfied on the naming scheme of codec merging. Specifically, add and both are quite synonym to me (add adds the new characters, both uses both set of character [while it actually only uses the new dataset vocabulary]). While it might be a breaking change, I would very much recommend moving to add and replace or something like this.

PonteIneptique commented 1 year ago

To do something clean regarding both changing to something such as replace, we could allow both argument with a DeprecationWarning and map to replace

dstoekl commented 1 year ago

I support this and would support also the possibility to train excluding new characters. Something that would be clear and not break previous behavior could be: old_only new_only (corresponds to today's both) old_and_new (corresponds to today's add)

PonteIneptique commented 1 year ago

Maybe we should open a specific issue for this ;) See https://github.com/mittagessen/kraken/issues/478