finetune error on altos containing > or < as txt

mittagessen / kraken

OCR engine for all the languages

http://kraken.re

Apache License 2.0

750 stars 131 forks source link

finetune error on altos containing > or < as txt #616

Closed dstoekl closed 5 months ago

dstoekl commented 5 months ago

Finetuning on altos with > or < fails with the following error:

dstoekl commented 5 months ago

on training from scratch it advances by one codepoint

mittagessen commented 5 months ago

On 24/06/17 01:43PM, Daniel Stoekl wrote:

on training from scratch it advances by one codepoint

How are they encoded in the XML ? As normal XML entities?

dstoekl commented 5 months ago

mittagessen commented 5 months ago

I can't reproduce it. When I train on other data with > it works as expected and there's nothing that treats those entities differently, in particular not the codec. If you give me access to your data I can take a look.

mittagessen commented 5 months ago

Thinking about it is probably caused by Unicode mirroring. > is a mirroring character. The codec is constructed while the text is still in logical order but the actual encoding happens on lines already transformed into display order which maps 0x003E to 0x003C which isn't in the codec as you don't have any 0x003C (<) in there.

As a hotfix you can add a codec that includes < manually when training from scratch. I need to think a bit if just creating the codec from display order lines is going to break anything else.

mittagessen commented 5 months ago

You can also disable the mirroring by encapsulating the > in Left-to-Right markers which does make it compatible with fine-tuning as well.

dstoekl commented 5 months ago

It is unclear to users and will crash in eScr

mittagessen / kraken

finetune error on altos containing &gt; or &lt; as txt #616

finetune error on altos containing > or < as txt #616