mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
673 stars 125 forks source link

finetune error on altos containing > or < as txt #616

Closed dstoekl closed 2 weeks ago

dstoekl commented 2 weeks ago

Finetuning on altos with > or < fails with the following error: image image

dstoekl commented 2 weeks ago

on training from scratch it advances by one codepoint image

mittagessen commented 2 weeks ago

On 24/06/17 01:43PM, Daniel Stoekl wrote:

on training from scratch it advances by one codepoint image

How are they encoded in the XML ? As normal XML entities?

dstoekl commented 2 weeks ago

image

mittagessen commented 2 weeks ago

I can't reproduce it. When I train on other data with > it works as expected and there's nothing that treats those entities differently, in particular not the codec. If you give me access to your data I can take a look.

mittagessen commented 2 weeks ago

Thinking about it is probably caused by Unicode mirroring. > is a mirroring character. The codec is constructed while the text is still in logical order but the actual encoding happens on lines already transformed into display order which maps 0x003E to 0x003C which isn't in the codec as you don't have any 0x003C (<) in there.

As a hotfix you can add a codec that includes < manually when training from scratch. I need to think a bit if just creating the codec from display order lines is going to break anything else.

mittagessen commented 2 weeks ago

You can also disable the mirroring by encapsulating the > in Left-to-Right markers which does make it compatible with fine-tuning as well.

dstoekl commented 2 weeks ago

It is unclear to users and will crash in eScr