mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
736 stars 130 forks source link

`No boundary given for line` and very long repolygonize #259

Closed PonteIneptique closed 3 years ago

PonteIneptique commented 3 years ago

Hey @mittagessen, Currently, it seems that some export from eScriptorium are missing masks or at least something that is preventing the use of all data directly by Kraken. From what I understand, in this context, --repolygonize is the way to go.

My question is in two/three parts:

  1. What's the best way to deal with No boundary, from a user end perspective ?
  2. Could repolygonize profit from multithreading (right now it's not parallelized, the easiest way would probably be there: https://github.com/mittagessen/kraken/blob/0f6bfd21f60c6dbb39e86c56474b052d029cf332/kraken/lib/dataset.py#L299 ? )
  3. If we have repolygonization, could we save it somehow (connected to 1) just to avoid redoing it for the next training ?
PonteIneptique commented 3 years ago

BTW, if you want to test anything regarding this, we got a dataset there: https://github.com/HTR-United/cremma-medieval

mittagessen commented 3 years ago

Fix the bug in escriptorium. ;)

  1. Could repolygonize profit from multithreading (right now it's not parallelized, the easiest way would probably be there: https://github.com/mittagessen/kraken/blob/0f6bfd21f60c6dbb39e86c56474b052d029cf332/kraken/lib/dataset.py#L299 ? )

Hmm, I could move it from the XML parser to the dataset.

  1. If we have repolygonization, could we save it somehow (connected to 1) just to avoid redoing it for the next training ?

There is (was actually) already a script in contrib/ called repolygonize.py that did exactly that. I've committed a broken version while rewriting it but will fix it before the end of the week.

PonteIneptique commented 3 years ago
  1. The answer was expected, but you actually give other ideas further down in 3. ;)
  2. I'll let you decide :)
  3. That'd be cool
lauxley commented 3 years ago

Hello, if you want polygons be sure to click on the green button in the segmentation panel ("Segmentation is ready for mask calculation!"), the quality of polygons is a lot better once all the lines are drawn and we can't really guess when it is the case which is why this not automatic. You only need to do it once on a page, then they are recalculated automatically if need be. If you import data without polygons you can batch it by selecting your images and choosing 'Only line masks' in the segment form. Hope it helps.

PonteIneptique commented 3 years ago

if you want polygons be sure to click on the green button in the segmentation panel ("Segmentation is ready for mask calculation!")

I might be stupid, but can you screenshot me where this is ? :)

There is (was actually) already a script in contrib/ called repolygonize.py that did exactly that. I've committed a broken version while rewriting it but will fix it before the end of the week.

I'll close the issue when this is out :D

dstoekl commented 3 years ago

the green thumbs up button above the segmentation panel (2)

PonteIneptique commented 3 years ago

Well, we're gonna continue this conversation on segmentation on Gitlab, because we got no such things ? image

dstoekl commented 3 years ago

it is indeed visible only if no line has a polygon. what were your previous actions on this page?