qurator-spk / eynollah

Document Layout Analysis
Apache License 2.0
344 stars 29 forks source link

Run Time and Reading Order (of Text Recognition) #60

Closed Aysoltan closed 2 years ago

Aysoltan commented 2 years ago

Hallo everyone,

recently I tried eynollah to achieve better region recognition. I noticed two points:

  1. the run time per page took approx. 25 minutes
  2. after the text recognition on the basis of these regions the reading order of recognized text was not correctly displayed There have been few regions that eynollah has trouble with. Still, the tool recognized the regions better than tesseract. But the use would hardly be possible unter these condition. Is there some ways to fix this points?

Best, Aysoltan

vahidrezanezhad commented 2 years ago

Dear Aysoltan,

  1. It seems you are running Eynollah on CPU. GPU can accelerate the analysis.
  2. About the reading order , if you provide me the document and the output xml then I may help you more.

Regards

Aysoltan commented 2 years ago

Hi Vahid,

if you could give me please your email adress, i will send you the files. The workflow I used for is:

ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN ocrd-eynollah-segment -I OCR-D-BIN -O OCR-D-SEG -P models default ocrd-tesserocr-recognize -I OCR-D-SEG -O OCR-D-OCR-TESSEROCR -P model deu

Best, Aysoltan

cneud commented 2 years ago

Hi @Aysoltan, has this been sufficiently addressed, i.e. can we close here?

Also, @vahidrezanezhad now provided an extra "light version" (cf. https://github.com/qurator-spk/eynollah/tree/eynollah_light) which - in case of fairly simple images - will provide a signficant boost to processing speed without much loss of quality.

Edit: to use the "light version", the trained model from here must be used.

Aysoltan commented 2 years ago

Many thanks!

cneud commented 2 years ago

I take that as a yes ;-)

If problems persist, feel free to re-open or file a new issue.