mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
688 stars 125 forks source link

Colon in target label for --merge-regions #532

Closed rohanchn closed 9 months ago

rohanchn commented 11 months ago

Hi @mittagessen!

I am working on a segmentation model where I am trying to use --merge-regions. I understand that the format for --merge-regions is target:src.

But my target labels have a : like MainZone:column#1 (SegmOnto) and I am getting an Invalid value for '-mr' error.

Perhaps you have a suggestion to solve this?

rohanchn commented 11 months ago

So, my syntax was wrong as I was using -mr multiple times. This actually is not a problem! Closing this now.

My syntax was indeed wrong, which I corrected, and now it's training. But it still won't pick up the something like MarginTextZone:commentary:MarginTextZone:note because of the colon in the target (and src), which is what I wanted to do in the first place.

I guess this functions handles it, but not sure merely changing the separator would work. https://github.com/mittagessen/kraken/blob/c7562facd4b0f260d2b1dae7e1af1c34bd2dfca8/kraken/ketos/segmentation.py#L40-L53

mittagessen commented 11 months ago

It isn't possible on the command line as the parser of the mapping doesn't support escaping colons. But you can create arbitrary mappings when training with the API (or have dummy identifiers in your data and after training just rename the identifiers in the model metadata).

rohanchn commented 11 months ago

I think I will try training with the API. I find the latter slightly tricky. Thank you!