Closed jknockaert closed 2 years ago
I am afraid that there is no easy solution for this issue.
The Tesseract API only provides functions which use integer values for the number of pixels per inch. So does the Leptonica library which is used by Tesseract. The exact resolution of the input TIFF file is therefore not available.
TIFF files use a fraction value for the resolution, and mapping that to an integer value can always result in rounding errors.
Maybe we can get the output PDF page size without using the resolution if it is possible to get the physical dimensions of the TIFF image.
CC: @DanBloomberg
Yes, leptonica converts the resolution to pixels/inch by casting to int32.
if (resunit == RESUNIT_CENTIMETER) { /* convert to ppi */
*pxres = (l_int32)(2.54 * fxres + 0.5);
*pyres = (l_int32)(2.54 * fyres + 0.5);
} else {
*pxres = (l_int32)fxres;
*pyres = (l_int32)fyres;
}
Would it be preferable to round the pixels/inch, as it does with pixels/cm?
Rounding might be better than just casting to int32
, but would not solve the issue which is reported here where the resolution was 92.202 pixels/inch. The rounded value would be 92, too.
Maybe we can get the output PDF page size without using the resolution if it is possible to get the physical dimensions of the TIFF image.
IIRC, the the physical dimensions of the image is calculated based on DPI and number of pixels.
I did not find a Leptonica function which gets the physical dimensions. The existing functions return width and height in pixels.
Tesseract currently uses pixGetXRes
for the line detection and pixGetYRes
for the image resolution. That's not consistent, and it won't work for images with different x and y resolutions.
While looking at the code for the line detection, I noticed that it could use compiler optimizations like inlining. That's now realized in the main branch.
What i meant was that I don't think the image size in inches/centimeters is stored in the image metadata, and that tools like GIMP and ImageMagick calculate it from number of pixels and DPI.
Unless Leptonica will give us resolution in float, we can't solve the issue in our side.
For leptonica to give the resolution as a float, it would need to capture it that way. This requires a change in pix.h, but I think we can pull it off without too much disruption.
We would need to do the following: (1) store the res data in the pix as a float (2) make and use new accessors to get the float res data for the pdf files (3) change the L_COMP_DATA struct to hold float res. (4) change the functions that generate L_COMP_DATA from the pix, and the function that transfers it to the L_PDF_DATA struct.
and similar changes for PostScript generation in psio2.c
Returning float
resolutions would be a possible solution, but it would also be possible to add Leptonica functions which return pixels per meter (which is used by PNG). 92.202 pixels/inch is 3630 pixels/meter, so there would be no rounding error, and in any case rounding errors for pixels/meter are much smaller than for pixels/inch.
As you can see in pdfio1.c, we have functions that generate pdf files from both image files (or sets of image files) and also pix and pixa. The functions that start with files all go through pix (because you have to read the images into rasters).
So to make this work in general, the float resolution data needs to be in the pix.
Dan, your plan looks good to me :-)
Amit, why does your 25-bit github code look like a face :-) (besides the bilateral symmetry built in so it is really a 15-bit code)
Would you change the datatype of xres
and yres
from l_int32
to l_float32
, or do you plan to add two additional new fields? The main problem is that I see no way how to do such changes without breaking compatibility, so a new major version would be required.
@stweil What I said about having to go through pix to generate the pdf is not true, because for jpeg, jp2k and some png, we can generate the pdf data directly from the compressed file data (i.e., without transcoding through a pix).
But the general solution to the problem does require storing res data as a float whenever a pix is used (as in tiffG4) to make the pdf file.
Dan, :-)
Originally, GitHub automatically generated an upside down version of what you see now. I didn't like it because it looked like a sad face, so I manually rotated it.
But I didn't use Leptonica for the rotation, sorry...
@stweil When not transcoding through a pix, we get the resolution from jpeg files using fgetJpegResolution(), which returns integer values. Ditto for fgetJp2kResolution() and fgetPngResolution(). I'll need to make new functions returning floats, to avoid the risk of breaking applications.
@stweil Does it break compatibility to change them to floats? People should use the accessors, but of course we can't rely on that. So suppose someone does this: int i = pix->xres; (xres is now a float field) The data is coerced into an int. No cast is required and I don't even get a warning. Likewise, pix->xres = i; No problem either.
It's not a problem as long as the code is compiled with new headers, but old binaries (object or library files) will continue to assume an l_int32
value and get a wrong value if they did not use the accessors.
Ideally Leptonica would hide the details of the implementation and force use of accessors.
But of course it's C, so we can't enforce private data.
Is it really possible to get a skew between allheaders.h and the compiled library, which uses allheaders.h for the declarations? It seems to me that the compiled library will always be consistent with the headers and the struct definitions.
If there really is a problem, we can add two floats to the pix for resolution.
Even in C it is possible to hide the implementation details. For example the public Tesseract header files use pointers to struct BlamerBundle
, but the definition is in the private header file blamer.h
which is not part of the public API. Another example is FILE
from stdio.h
.
There will be no skew between allheaders.h
and the Leptonica library from the same version.
But there would be a skew between allheaders.h
and a third-party library which was compiled with an older version of allheaders.h
.
The Tesseract library is an example of a library which directly uses Leptonica internals: it accesses members of L_Compressed_Data
, so any change in that data structure would break the Tesseract code.
So even if we were to add two fields to the pix, instead of change the fields, would that break Tesseract?
Only changes of L_Compressed_Data
would break Tesseract because no internals of other Leptonica data types are used in the Tesseract code.
So any modification of Pix
would not affect Tesseract. For other software which uses Tesseract this can be different.
That's a problem, because we'd need to change the res field of L_Compressed_Data from int32 to float32.
@stweil So am I correct that if we are to fix this problem, by making any alterations in the leptonica structs, we must go from so.5.4.0 to so.6.0.0?
Bringing Jeff into this as well, as that would be his headache in the Debian world. @jbreiden Jeff, if we're going to 6.0.0, perhaps we should additionally make the autotools library naming change that is hanging out there: Issue https://github.com/DanBloomberg/leptonica/issues/598 liblept --> libleptonica Also, am I correct that PR #599 changed library naming, using the so numbers instead of the version number, or is that only with CMake?
@DanielEngberg Referenced from Issue #599 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260079#c3 Did FreeBSD switch to CMake and do they have any outstanding naming issues?
So am I correct that if we are to fix this problem, by making any alterations in the leptonica structs, we must go from so.5.4.0 to so.6.0.0?
I am afraid, yes. As this is a major change, you might consider other breaking changes, too:
l_int32
, ...) by POSIX data types (int32_t
). To ease migration, the old data types could be provided optionally.@DanBloomberg Current pending patch symlinks both naming variants to provide compatibility https://bz-attachments.freebsd.org/attachment.cgi?id=230746 (post-install section) https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260079 for more information
It would be good to avoid the necessity for such patches in the future!
On the built-in data types, I was not aware that there were any breakages resulting from these typedefs. And even if we replaced all the l_* types with POSIX, we'd still need to keep all the typedefs in environ.h to support applications.
Still confused about data hiding in C. Can you give me an example of how I could hide the L_COMP_DATA struct in a private header file, in such a way that an application couldn't access its fields? Note that multiple C files in the library need access to its internal structure, as well as prog/pdfio1_reg.c.
https://abi-laboratory.pro/?view=timeline&l=leptonica
This report is produced by using some open source tools: https://github.com/lvc?tab=repositories
So there were already incompatible changes (new function parameters, changed type of function parameters) in the past without an update of the major version.
If that did not cause trouble for users, changing the resolution from l_int32
to l_float32
without a new major version might work, too.
On the built-in data types, I was not aware that there were any breakages resulting from these typedefs.
The Leptonica integer data types don't cause breakages. But they are simply no longer needed, because there already exist standard data types for signed and unsigned integers of all sizes with different names but exactly the same meaning since many years now.
Removing the Leptonica integer data types would break existing code which uses them. Such code would have to be modified to use the standard POSIX data types.
Thanks all.
I would like to go ahead with the l_int32 --> l_float32 conversion in the resolution fields in pix and L_Compress_Data. However, I need guidance if it will cause issues with tesseract because of the use of L_Compress_Data. I would also like to up-version to so.6.0.0, and am waiting on @jbreiden there.
Incompatible changes to Leptonica have enough cost that they should not be done lightly. It includes recompiling all programs using Leptonica, and for those that need modification (like jbig2enc) getting the modification done, plus making them finicky about which versions of Leptonica they link against.
So I'd like to take a step back here and discuss the problem being solved.
The complaint that started this bug was about an image that is a bit under 8 inches wide. After Leptonica does some rounding (for pixels per inch) the PDF produced is the wrong size, but 0.017 inches. For those metrically inclined, we are talking 0.44 millimeters.
Perhaps this level of precision (or imprecision!) is acceptable? Could Jasper or someone else explain in more detail why this is not good enough?
Message ID: @.***>
Slight correction. The issue came up with a screen resolution of 92 ppi. The rounding causes a maximum error at 92 ppi of 0.5 ppi, which is a little more than plus or minus 0.5%. For a page dimension of 254 mm (10 inches), that max error is about 1.3 mm. Still very small.
On Wed, Feb 16, 2022 at 1:51 PM jbreiden @.***> wrote:
Incompatible changes to Leptonica have enough cost that they should not be done lightly. It includes recompiling all programs using Leptonica, and for those that need modification (like jbig2enc) getting the modification done, plus making them finicky about which versions of Leptonica they link against.
So I'd like to take a step back here and discuss the problem being solved.
The complaint that started this bug was about an image that is a bit under 8 inches wide. After Leptonica does some rounding (for pixels per inch) the PDF produced is the wrong size, but 0.017 inches. For those metrically inclined, we are talking 0.44 millimeters.
Perhaps this level of precision (or imprecision!) is acceptable? Could Jasper or someone else explain in more detail why this is not good enough?
Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/3738#issuecomment-1042343906, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7KMLEWHH5IESO2JMILTRLU3QL4NANCNFSM5NKDA56A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
While we wait for someone to speak to the benefit of the proposed change, let's look into the costs in more detail.
The list of direct dependencies in Debian are quite small. That's good. Each of these programs will have to change their builds if the Leptonica library name changes. And endure the pain of becoming finicky about Leptonica version.
I don't know if any of these programs would need a code change; it probably makes sense to look at their source and see. I am surprised that Debian is not shipping jbig2enc.
So yes, I think the cost are surmountable. But not negligible, and it still makes sense to be thoughtful about whether the incompatible change is worth it.
liblept5 Reverse Depends: ccextractor (0.93+ds2-1) Reverse Depends: k2pdfopt (2.53+ds-1+build1) Reverse Depends: kylin-scanner (1.0.0-1.1+build2) Reverse Depends: leptonica-progs (1.82.0-3) Reverse Depends: liblept5-dbgsym (= 1.82.0-3) Reverse Depends: libleptonica-dev (= 1.82.0-3) Reverse Depends: libtesseract4 (>= 4.1.1-2.1) Reverse Depends: python3-tesserocr (2.5.2-1) Reverse Depends: tesseract-ocr (>= 4.1.1-2.1)
Message ID: @.***>
Thank you, Jeff.
Following your input, let's do this: (1) I'll fix the input resolution in the tiff reader to round instead of truncate. Note that for png, jpeg and bmp, resolution is saved as an int, and we are properly rounding whenever the units are changed (such as pixels/meter to pixels/inch for png, or pixels/cm to pixels/inch in jpeg). (2) We'll not mess with saving resolution in the pix as a float. At 100 ppi, the maximum error from rounding is plus/minus 0.5%. For a 10 inch image, the results in a maximum page error size of 0.05 inch, slightly more than 1 mm. So this is the best we will do on this issue.
Let's defer the library name change from liblept to libleptonica in autotools (and Debian). I still plan to do this, perhaps in a month, and we'll see what chaos results when I put it out as 1.83.0. Question: in src/Makefile.am, will we still need the install-data-hook and uninstall-hook targets?
Changing the .so names to the canonical format and the .so number to 6.0.0 could happen later, as 1.84.0.
Question: in src/Makefile.am, will we still need the install-data-hook and uninstall-hook targets?
I don't know. I've never heard of these targets before this question. I'm pretty sure they aren't explicitly required by Debian, but I have no idea what useful things they might be doing under the hood.
Dan passed this question to me. I hadn't noticed these targets before as they were added after I was last closely involved. They just install (and uninstall) a symlink from libleptonica.so to liblept.so (or similar) though. If you're renaming the library, do you want a symlink going back the other way for compatibility? If not, just delete them.
As the code stands, the install-data-hook
target is broken in any case. If you're doing a parallel build (typical in Gentoo) then this target will usually be executed before the real library is installed. It checks for the library's presence with test -f
so no symlink is created in this case.
@stweil, can we close this issue?
Environment
Current Behavior:
When using an input tiff file to create a text only pdf (with
-c textonly_pdf=1
) the pagewidth of the output pdf is calculated using the pixel width of the input tiff divided by the resolution of the tiff. However, something goes wrong when the tiff resolution is a decimal number (float) rather than an integer (int). Tesseract seems to use floor(resolution) in that case.E.g. an input tiff with width=726 pixels and resolution 92.202 pixels/inch results in an output pdf with page size 568.17 pts (1 pt = 1/72 inch). Tesseract used 92 pixels/inch to get to that result. (I'm limiting the example to page width, but the same happens to page height.)
Expected Behavior and suggested fix:
Tesseract should use the exact resolution of the input tiff file to calculate the output pdf file page size.