houqp / leptess

Productive and safe Rust binding for leptonica and tesseract
https://houqp.github.io/leptess/leptess/index.html
MIT License
258 stars 28 forks source link

Invalid resolution 0 dpi warning in stderr #6

Closed timvisee closed 4 years ago

timvisee commented 4 years ago

When using this crate, I occasionally receive a warning in stderr when opening/reading an image. I assume this is produced by the leptess/tesseract library.

This is what it looks like:

Warning: Invalid resolution 0 dpi. Using 70 instead.

It does not look like it is possible to disable this behavior through the current API. Are there any plans to implement a toggle for this?

houqp commented 4 years ago

Based on discussion from https://github.com/tesseract-ocr/tesseract/issues/1702, I added a new set_source_resolution, could you give that a try?

timvisee commented 4 years ago

Thank you, didn't notice this option could be a solution.

I'm wondering, would this overwrite the resolution in case the image resolution is known? Because I can imagine this could then cause undesirable behavior if setting the resolution to 70 for all images, even though the resolution might be known for some.

houqp commented 4 years ago

Yeah, good point. Added get_source_y_resolution and set_fallback_source_resolution methods, could you give that a try?

timvisee commented 4 years ago

Using set_fallback_source_resolution did the trick. No warnings show up anymore.

Thanks for the rapid addition!

timvisee commented 4 years ago

I just noticed an interesting edge case. It appears that some images have a DPI of 1 defined (and yes, that's incorrect). tesseract produces a warning for this as well:

Warning: Invalid resolution 1 dpi. Using 70 instead.

It's interesting, because this isn't covered by the set_fallback_source_resolution function. Don't worry, it's not much a problem. Just posting this for other to see, that this isn't currently solved, if they're experiencing the same. I might open an issue for this on tesseract in the future.

In case you're wondering. I'm scanning all images, stickers, videos and such from Telegram groups (for smart spam prevention). As you can probably imagine, I'm receiving a wide spectrum of images, image types, sizes and formats. That's why I'm seeing these weird edge cases.

houqp commented 4 years ago

That's interesting, i wonder what's the range of dpi that tesseract would consider invalid. If it can't work with 1 dpi images, then it makes sense to add it to the fallback method.

timvisee commented 4 years ago

That's interesting, i wonder what's the range of dpi that tesseract would consider invalid. If it can't work with 1 dpi images, then it makes sense to add it to the fallback method.

I didn't notice the fallback method is only part of this library, and thought it was provided by tesseract. I'll try and search for the range and update the function.

timvisee commented 4 years ago

The original warning appears to be coming from the following section, and changes the DPI if the detected DPI is outside a specified range: https://github.com/tesseract-ocr/tesseract/blob/247cd0edc44e0a4b6cf46f1faccdb5d1557ed1f0/src/api/baseapi.cpp#L2017-L2031

The allowed DPI range is defined here: https://github.com/tesseract-ocr/tesseract/blob/247cd0edc44e0a4b6cf46f1faccdb5d1557ed1f0/src/ccstruct/publictypes.h#L33-L39

Note that it only automatically changes the used DPI to the lowest in the allowed range if the user didn't specify a DPI himself. And it does not change the DPI if the user did explicitly set it to something outside the allowed range.

I'll look into improving this crate for these findings now.