mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
746 stars 131 forks source link

issue with pages without text #247

Closed pverkind closed 3 years ago

pverkind commented 3 years ago

Hi, I ran into a problem trying to OCR pdf files that contain empty pages (kraken version: 3.0.0.0b21.dev6). I extracted the images in png format from the pdf using the poppler utility pdftoppm - this creates one image per page.

When I run kraken on the page images that contain text, it works without problem.

kraken -i <img_fp> <dest_fp> binarize segment -d horizontal-rl -p 20 20 ocr -m <pth_to_model> --pagexml

However, for pages that don't contain any text, kraken gets stuck in the segmentation step:

image

I have created a crude workaround for this in my Python script, but I thought I should flag this issue here...

(I have attached such a png file of an empty page)

0677IbnMuyassar AkhbarMisr_007

mittagessen commented 3 years ago

The segmenter should just skip empty pages but yours aren't really empty as they contain non-0/255 pixels, probably because they were dumped as jpegs in the pdf. In fact, on the sample page you attached it is just fairly slow because it find ~8k lines in there. The new trainable segmenter (-bl option) doesn't have this problem but requires transcription models trained for it.

BTW: You can just feed PDFs directly into kraken now. No need to split into separate image files:

kraken -f pdf -I xyz.pdf ....
pverkind commented 3 years ago

Thanks for your quick reply!

The reason we don't use the pdf input is that we want to have both the transcribed image and the pagexml/alto transcription. The ideal solution for us would be if there were an option in Kraken to save the image files Kraken extracts from the pdf with the same filename (but different extension) as the output xml file.

Would it be possible to add that functionality? I think it would be useful for many users who want to map the transcription back to the original images for post-correction and other purposes.

brobertson commented 3 years ago

I think this bug is related to my open bug #183, which I proposed to solve by putting a limit on the number of cc's allowed in a page. In the meantime, I use gnu parallel and have each page time out if it doesn't complete in a reasonable time.