Closed timvisee closed 4 years ago
Based on discussion from https://github.com/tesseract-ocr/tesseract/issues/1702, I added a new set_source_resolution, could you give that a try?
Thank you, didn't notice this option could be a solution.
I'm wondering, would this overwrite the resolution in case the image resolution is known? Because I can imagine this could then cause undesirable behavior if setting the resolution to 70 for all images, even though the resolution might be known for some.
Yeah, good point. Added get_source_y_resolution
and set_fallback_source_resolution
methods, could you give that a try?
Using set_fallback_source_resolution
did the trick. No warnings show up anymore.
Thanks for the rapid addition!
I just noticed an interesting edge case. It appears that some images have a DPI of 1 defined (and yes, that's incorrect). tesseract
produces a warning for this as well:
Warning: Invalid resolution 1 dpi. Using 70 instead.
It's interesting, because this isn't covered by the set_fallback_source_resolution
function.
Don't worry, it's not much a problem. Just posting this for other to see, that this isn't currently solved, if they're experiencing the same. I might open an issue for this on tesseract
in the future.
In case you're wondering. I'm scanning all images, stickers, videos and such from Telegram groups (for smart spam prevention). As you can probably imagine, I'm receiving a wide spectrum of images, image types, sizes and formats. That's why I'm seeing these weird edge cases.
That's interesting, i wonder what's the range of dpi that tesseract would consider invalid. If it can't work with 1 dpi images, then it makes sense to add it to the fallback method.
That's interesting, i wonder what's the range of dpi that tesseract would consider invalid. If it can't work with 1 dpi images, then it makes sense to add it to the fallback method.
I didn't notice the fallback method is only part of this library, and thought it was provided by tesseract. I'll try and search for the range and update the function.
The original warning appears to be coming from the following section, and changes the DPI if the detected DPI is outside a specified range: https://github.com/tesseract-ocr/tesseract/blob/247cd0edc44e0a4b6cf46f1faccdb5d1557ed1f0/src/api/baseapi.cpp#L2017-L2031
The allowed DPI range is defined here: https://github.com/tesseract-ocr/tesseract/blob/247cd0edc44e0a4b6cf46f1faccdb5d1557ed1f0/src/ccstruct/publictypes.h#L33-L39
Note that it only automatically changes the used DPI to the lowest in the allowed range if the user didn't specify a DPI himself. And it does not change the DPI if the user did explicitly set it to something outside the allowed range.
I'll look into improving this crate for these findings now.
When using this crate, I occasionally receive a warning in
stderr
when opening/reading an image. I assume this is produced by the leptess/tesseract library.This is what it looks like:
It does not look like it is possible to disable this behavior through the current API. Are there any plans to implement a toggle for this?