tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.13k stars 9.39k forks source link

Wrong page size in text only pdf file #3738

Closed jknockaert closed 2 years ago

jknockaert commented 2 years ago

Environment

Current Behavior:

When using an input tiff file to create a text only pdf (with -c textonly_pdf=1) the pagewidth of the output pdf is calculated using the pixel width of the input tiff divided by the resolution of the tiff. However, something goes wrong when the tiff resolution is a decimal number (float) rather than an integer (int). Tesseract seems to use floor(resolution) in that case.

E.g. an input tiff with width=726 pixels and resolution 92.202 pixels/inch results in an output pdf with page size 568.17 pts (1 pt = 1/72 inch). Tesseract used 92 pixels/inch to get to that result. (I'm limiting the example to page width, but the same happens to page height.)

Expected Behavior and suggested fix:

Tesseract should use the exact resolution of the input tiff file to calculate the output pdf file page size.

stweil commented 2 years ago

I am afraid that there is no easy solution for this issue.

The Tesseract API only provides functions which use integer values for the number of pixels per inch. So does the Leptonica library which is used by Tesseract. The exact resolution of the input TIFF file is therefore not available.

TIFF files use a fraction value for the resolution, and mapping that to an integer value can always result in rounding errors.

Maybe we can get the output PDF page size without using the resolution if it is possible to get the physical dimensions of the TIFF image.

amitdo commented 2 years ago

CC: @DanBloomberg

DanBloomberg commented 2 years ago

Yes, leptonica converts the resolution to pixels/inch by casting to int32.

if (resunit == RESUNIT_CENTIMETER) {  /* convert to ppi */
        *pxres = (l_int32)(2.54 * fxres + 0.5);
        *pyres = (l_int32)(2.54 * fyres + 0.5);
    } else {
        *pxres = (l_int32)fxres;
        *pyres = (l_int32)fyres;
    }

Would it be preferable to round the pixels/inch, as it does with pixels/cm?

stweil commented 2 years ago

Rounding might be better than just casting to int32, but would not solve the issue which is reported here where the resolution was 92.202 pixels/inch. The rounded value would be 92, too.

amitdo commented 2 years ago

Maybe we can get the output PDF page size without using the resolution if it is possible to get the physical dimensions of the TIFF image.

IIRC, the the physical dimensions of the image is calculated based on DPI and number of pixels.

stweil commented 2 years ago

I did not find a Leptonica function which gets the physical dimensions. The existing functions return width and height in pixels.

stweil commented 2 years ago

Tesseract currently uses pixGetXRes for the line detection and pixGetYRes for the image resolution. That's not consistent, and it won't work for images with different x and y resolutions.

While looking at the code for the line detection, I noticed that it could use compiler optimizations like inlining. That's now realized in the main branch.

amitdo commented 2 years ago

What i meant was that I don't think the image size in inches/centimeters is stored in the image metadata, and that tools like GIMP and ImageMagick calculate it from number of pixels and DPI.

amitdo commented 2 years ago

Unless Leptonica will give us resolution in float, we can't solve the issue in our side.

DanBloomberg commented 2 years ago

For leptonica to give the resolution as a float, it would need to capture it that way. This requires a change in pix.h, but I think we can pull it off without too much disruption.

We would need to do the following: (1) store the res data in the pix as a float (2) make and use new accessors to get the float res data for the pdf files (3) change the L_COMP_DATA struct to hold float res. (4) change the functions that generate L_COMP_DATA from the pix, and the function that transfers it to the L_PDF_DATA struct.

DanBloomberg commented 2 years ago

and similar changes for PostScript generation in psio2.c

stweil commented 2 years ago

Returning float resolutions would be a possible solution, but it would also be possible to add Leptonica functions which return pixels per meter (which is used by PNG). 92.202 pixels/inch is 3630 pixels/meter, so there would be no rounding error, and in any case rounding errors for pixels/meter are much smaller than for pixels/inch.

DanBloomberg commented 2 years ago

As you can see in pdfio1.c, we have functions that generate pdf files from both image files (or sets of image files) and also pix and pixa. The functions that start with files all go through pix (because you have to read the images into rasters).

So to make this work in general, the float resolution data needs to be in the pix.

amitdo commented 2 years ago

Dan, your plan looks good to me :-)

DanBloomberg commented 2 years ago

Amit, why does your 25-bit github code look like a face :-) (besides the bilateral symmetry built in so it is really a 15-bit code)

stweil commented 2 years ago

Would you change the datatype of xres and yres from l_int32 to l_float32, or do you plan to add two additional new fields? The main problem is that I see no way how to do such changes without breaking compatibility, so a new major version would be required.

DanBloomberg commented 2 years ago

@stweil What I said about having to go through pix to generate the pdf is not true, because for jpeg, jp2k and some png, we can generate the pdf data directly from the compressed file data (i.e., without transcoding through a pix).

But the general solution to the problem does require storing res data as a float whenever a pix is used (as in tiffG4) to make the pdf file.

amitdo commented 2 years ago

Dan, :-)

Originally, GitHub automatically generated an upside down version of what you see now. I didn't like it because it looked like a sad face, so I manually rotated it.

But I didn't use Leptonica for the rotation, sorry...

DanBloomberg commented 2 years ago

@stweil When not transcoding through a pix, we get the resolution from jpeg files using fgetJpegResolution(), which returns integer values. Ditto for fgetJp2kResolution() and fgetPngResolution(). I'll need to make new functions returning floats, to avoid the risk of breaking applications.

DanBloomberg commented 2 years ago

@stweil Does it break compatibility to change them to floats? People should use the accessors, but of course we can't rely on that. So suppose someone does this: int i = pix->xres; (xres is now a float field) The data is coerced into an int. No cast is required and I don't even get a warning. Likewise, pix->xres = i; No problem either.

stweil commented 2 years ago

It's not a problem as long as the code is compiled with new headers, but old binaries (object or library files) will continue to assume an l_int32 value and get a wrong value if they did not use the accessors.

Ideally Leptonica would hide the details of the implementation and force use of accessors.

DanBloomberg commented 2 years ago

But of course it's C, so we can't enforce private data.

Is it really possible to get a skew between allheaders.h and the compiled library, which uses allheaders.h for the declarations? It seems to me that the compiled library will always be consistent with the headers and the struct definitions.

If there really is a problem, we can add two floats to the pix for resolution.

stweil commented 2 years ago

Even in C it is possible to hide the implementation details. For example the public Tesseract header files use pointers to struct BlamerBundle, but the definition is in the private header file blamer.h which is not part of the public API. Another example is FILE from stdio.h.

There will be no skew between allheaders.h and the Leptonica library from the same version.

But there would be a skew between allheaders.h and a third-party library which was compiled with an older version of allheaders.h.

The Tesseract library is an example of a library which directly uses Leptonica internals: it accesses members of L_Compressed_Data, so any change in that data structure would break the Tesseract code.

DanBloomberg commented 2 years ago

So even if we were to add two fields to the pix, instead of change the fields, would that break Tesseract?

stweil commented 2 years ago

Only changes of L_Compressed_Data would break Tesseract because no internals of other Leptonica data types are used in the Tesseract code.

So any modification of Pix would not affect Tesseract. For other software which uses Tesseract this can be different.

DanBloomberg commented 2 years ago

That's a problem, because we'd need to change the res field of L_Compressed_Data from int32 to float32.

@stweil So am I correct that if we are to fix this problem, by making any alterations in the leptonica structs, we must go from so.5.4.0 to so.6.0.0?

Bringing Jeff into this as well, as that would be his headache in the Debian world. @jbreiden Jeff, if we're going to 6.0.0, perhaps we should additionally make the autotools library naming change that is hanging out there: Issue https://github.com/DanBloomberg/leptonica/issues/598 liblept --> libleptonica Also, am I correct that PR #599 changed library naming, using the so numbers instead of the version number, or is that only with CMake?

@DanielEngberg Referenced from Issue #599 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260079#c3 Did FreeBSD switch to CMake and do they have any outstanding naming issues?

stweil commented 2 years ago

So am I correct that if we are to fix this problem, by making any alterations in the leptonica structs, we must go from so.5.4.0 to so.6.0.0?

I am afraid, yes. As this is a major change, you might consider other breaking changes, too:

diizzyy commented 2 years ago

@DanBloomberg Current pending patch symlinks both naming variants to provide compatibility https://bz-attachments.freebsd.org/attachment.cgi?id=230746 (post-install section) https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260079 for more information

DanBloomberg commented 2 years ago

It would be good to avoid the necessity for such patches in the future!

On the built-in data types, I was not aware that there were any breakages resulting from these typedefs. And even if we replaced all the l_* types with POSIX, we'd still need to keep all the typedefs in environ.h to support applications.

Still confused about data hiding in C. Can you give me an example of how I could hide the L_COMP_DATA struct in a private header file, in such a way that an application couldn't access its fields? Note that multiple C files in the library need access to its internal structure, as well as prog/pdfio1_reg.c.

amitdo commented 2 years ago

https://abi-laboratory.pro/?view=timeline&l=leptonica

This report is produced by using some open source tools: https://github.com/lvc?tab=repositories

stweil commented 2 years ago

So there were already incompatible changes (new function parameters, changed type of function parameters) in the past without an update of the major version.

If that did not cause trouble for users, changing the resolution from l_int32 to l_float32 without a new major version might work, too.

stweil commented 2 years ago

On the built-in data types, I was not aware that there were any breakages resulting from these typedefs.

The Leptonica integer data types don't cause breakages. But they are simply no longer needed, because there already exist standard data types for signed and unsigned integers of all sizes with different names but exactly the same meaning since many years now.

Removing the Leptonica integer data types would break existing code which uses them. Such code would have to be modified to use the standard POSIX data types.

DanBloomberg commented 2 years ago

Thanks all.

I would like to go ahead with the l_int32 --> l_float32 conversion in the resolution fields in pix and L_Compress_Data. However, I need guidance if it will cause issues with tesseract because of the use of L_Compress_Data. I would also like to up-version to so.6.0.0, and am waiting on @jbreiden there.

jbreiden commented 2 years ago

Incompatible changes to Leptonica have enough cost that they should not be done lightly. It includes recompiling all programs using Leptonica, and for those that need modification (like jbig2enc) getting the modification done, plus making them finicky about which versions of Leptonica they link against.

So I'd like to take a step back here and discuss the problem being solved.

The complaint that started this bug was about an image that is a bit under 8 inches wide. After Leptonica does some rounding (for pixels per inch) the PDF produced is the wrong size, but 0.017 inches. For those metrically inclined, we are talking 0.44 millimeters.

Perhaps this level of precision (or imprecision!) is acceptable? Could Jasper or someone else explain in more detail why this is not good enough?

Message ID: @.***>

DanBloomberg commented 2 years ago

Slight correction. The issue came up with a screen resolution of 92 ppi. The rounding causes a maximum error at 92 ppi of 0.5 ppi, which is a little more than plus or minus 0.5%. For a page dimension of 254 mm (10 inches), that max error is about 1.3 mm. Still very small.

On Wed, Feb 16, 2022 at 1:51 PM jbreiden @.***> wrote:

Incompatible changes to Leptonica have enough cost that they should not be done lightly. It includes recompiling all programs using Leptonica, and for those that need modification (like jbig2enc) getting the modification done, plus making them finicky about which versions of Leptonica they link against.

So I'd like to take a step back here and discuss the problem being solved.

The complaint that started this bug was about an image that is a bit under 8 inches wide. After Leptonica does some rounding (for pixels per inch) the PDF produced is the wrong size, but 0.017 inches. For those metrically inclined, we are talking 0.44 millimeters.

Perhaps this level of precision (or imprecision!) is acceptable? Could Jasper or someone else explain in more detail why this is not good enough?

Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/3738#issuecomment-1042343906, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7KMLEWHH5IESO2JMILTRLU3QL4NANCNFSM5NKDA56A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

jbreiden commented 2 years ago

While we wait for someone to speak to the benefit of the proposed change, let's look into the costs in more detail.

The list of direct dependencies in Debian are quite small. That's good. Each of these programs will have to change their builds if the Leptonica library name changes. And endure the pain of becoming finicky about Leptonica version.

I don't know if any of these programs would need a code change; it probably makes sense to look at their source and see. I am surprised that Debian is not shipping jbig2enc.

So yes, I think the cost are surmountable. But not negligible, and it still makes sense to be thoughtful about whether the incompatible change is worth it.

liblept5 Reverse Depends: ccextractor (0.93+ds2-1) Reverse Depends: k2pdfopt (2.53+ds-1+build1) Reverse Depends: kylin-scanner (1.0.0-1.1+build2) Reverse Depends: leptonica-progs (1.82.0-3) Reverse Depends: liblept5-dbgsym (= 1.82.0-3) Reverse Depends: libleptonica-dev (= 1.82.0-3) Reverse Depends: libtesseract4 (>= 4.1.1-2.1) Reverse Depends: python3-tesserocr (2.5.2-1) Reverse Depends: tesseract-ocr (>= 4.1.1-2.1)

Message ID: @.***>

DanBloomberg commented 2 years ago

Thank you, Jeff.

Following your input, let's do this: (1) I'll fix the input resolution in the tiff reader to round instead of truncate. Note that for png, jpeg and bmp, resolution is saved as an int, and we are properly rounding whenever the units are changed (such as pixels/meter to pixels/inch for png, or pixels/cm to pixels/inch in jpeg). (2) We'll not mess with saving resolution in the pix as a float. At 100 ppi, the maximum error from rounding is plus/minus 0.5%. For a 10 inch image, the results in a maximum page error size of 0.05 inch, slightly more than 1 mm. So this is the best we will do on this issue.

Let's defer the library name change from liblept to libleptonica in autotools (and Debian). I still plan to do this, perhaps in a month, and we'll see what chaos results when I put it out as 1.83.0. Question: in src/Makefile.am, will we still need the install-data-hook and uninstall-hook targets?

Changing the .so names to the canonical format and the .so number to 6.0.0 could happen later, as 1.84.0.

jbreiden commented 2 years ago

Question: in src/Makefile.am, will we still need the install-data-hook and uninstall-hook targets?

I don't know. I've never heard of these targets before this question. I'm pretty sure they aren't explicitly required by Debian, but I have no idea what useful things they might be doing under the hood.

chewi commented 2 years ago

Dan passed this question to me. I hadn't noticed these targets before as they were added after I was last closely involved. They just install (and uninstall) a symlink from libleptonica.so to liblept.so (or similar) though. If you're renaming the library, do you want a symlink going back the other way for compatibility? If not, just delete them.

As the code stands, the install-data-hook target is broken in any case. If you're doing a parallel build (typical in Gentoo) then this target will usually be executed before the real library is installed. It checks for the library's presence with test -f so no symlink is created in this case.

amitdo commented 2 years ago

@stweil, can we close this issue?