sungaila / PDFtoImage

A .NET library to render PDF files into images.
https://www.sungaila.de/PDFtoImage/
MIT License
176 stars 19 forks source link

When using bounds, the output gets scaled to the same size as the input document #82

Closed osjoberg closed 3 months ago

osjoberg commented 3 months ago

PDFtoImage version

4.0.1

OS

Windows

OS version

Windows 11

Architecture

x64

Framework

.NET (Core)

App framework

No response

Detailed bug report

I am extracting bits from PDF files using bounds. However it seems like the output is scaled to the size of the input document rather than using the size of the bounds, which is what I belive would be the correct behaviour.

Bounds.pdf is a 2:1 aspect ratio (2000pt x 1000pt) document containing a red square with a black border of size 500pt at (750, 250).

Sample repro:

using var input = File.OpenRead("Bounds.pdf"); 

var bounds = new RectangleF(750, 250, 500, 500); 

var dpi = 72; // For claifying the problem.

using var image = PDFtoImage.Conversion.ToImage(input, true, null, 0, new RenderOptions { Dpi = dpi, Bounds = bounds });

// BoundsCrop.png is cropped correctly but scaled to the size of the input document.
using var output = File.OpenWrite("BoundsCrop.png"); 
image.Encode(output, SKEncodedImageFormat.Png, 100);

// Expected: Width: 500, Height: 500
// Actual: Width: 2000, Height: 1000
Console.WriteLine($"Width: {image.Width}, Height {image.Height}"); 

Bounds.pdf BoundsCrop.png

sungaila commented 3 months ago

Hi @osjoberg,

thanks for your feedback! You are correct about the DPI always relating to the original document and I agree that it is highly confusing.

You could workaround this issue by setting width and height to 500 and recalculate the bounds yourself. But even that is not so easy because the bounds width and height relate to the document while the X and Y coordinates relate to the image output.

Most likely I'll introduce a new boolean to switch between the current and expected behavior. Will post in this issue again once there is something to test.

osjoberg commented 3 months ago

I am looking forward to test when you have added the flag for the expected behaviour.

Thank you reply @sungaila!

osjoberg commented 3 months ago

I tried to work around this issue but it is difficult. As the bounds can be floating point coordinates while the output image size that is required to work around this are integers, you cannot simply round the output image size.

If you round the output image size, the output aspect ratio will be slightly skewed, causing artefacts. I tried to take the rounding of the output image size into consideration and recaulucated the bounds to compensate for the rounding, but I have not been able to make it perfect as of now.

My gut feeling is that the current behaviour is so broken that very few are actually using this feature. Fixing this without introducing a compatibility flag might even be ok.

Please let me know if I can help you in any way. Testing or maybe a pull request?

sungaila commented 3 months ago

My gut feeling is that the current behaviour is so broken that very few are actually using this feature.

The current implementation allows the output to be stretched (keeping the original canvas while adding borders or rendering a subset of the PDF). Your use case is about rendering a subset of the PDF independent of the original canvas.

Please let me know if I can help you in any way. Testing or maybe a pull request?

I haven't had the time to dive into this subject yet. There are a ton of unit tests and side effects (width, height, dpi, withAspectRatio, rotation) that I have to wrap my mind around to avoid breaking existing projects and expected behavior.

osjoberg commented 3 months ago

Sorry, I did not realize the canvas border aspect of the feature!

sungaila commented 3 months ago

@osjoberg Could you please give the following test binaries a try: PDFtoImage.4.0.1-debug.zip

You can activate the new behavior by setting the DpiRelativeToBounds option:

using var image = PDFtoImage.Conversion.ToImage(input, true, null, 0, new RenderOptions { Dpi = dpi, Bounds = bounds, DpiRelativeToBounds = true });

I also deployed this version to the WebConverter so it can be tested there as well (might need to refresh/clear the cache first): https://www.sungaila.de/PDFtoImage/

osjoberg commented 3 months ago

Thank you! I did a quick test just to assess if it works with the repro above, looking good!

I will re-run our tests with this release tomorrow with the ~60 test PDF-files that I clip into and report back to you.

Cheers!

sungaila commented 3 months ago

@osjoberg Glad to hear it! I will wait for your test results before releasing v4.0.2.

In the meantime I have to add a few unit tests for this new option.

osjoberg commented 3 months ago

Happy to report that I have tested it today and everything is working excellent. Thank you!

osjoberg commented 3 months ago

Actually found one issue with the bounds getting rounded down, getting back to you later with a repro.

osjoberg commented 3 months ago
using var input = File.OpenRead("05.pdf"); 

var bounds = new RectangleF(822.5f, 1296.6f, 612.4f, 402.2f);

// Expecting to get a ~1 pixel wide background outside of the blue border with the line below uncommented.
// bounds = new RectangleF(bounds.Left - 1, bounds.Top -1, bounds.Width + 2, bounds.Height + 2);

var dpi = 72; // For claifying the problem.

using var image = PDFtoImage.Conversion.ToImage(input, false, null, 0, new RenderOptions { Dpi = dpi, Bounds = bounds, DpiRelativeToBounds = true });

// BoundsCrop.png is cropped but borders are uneavenly thick, seems like at least one pixel is lost to the right and bottom of the rectangle.
using var output = File.OpenWrite("BoundsCrop.png"); 
image.Encode(output, SKEncodedImageFormat.Png, 100);
output.Close();

// Expected: Width: 613, Height: 403 (bounds has width: 612.4, height: 402.2)
// Actual: Width: 612, Height: 402
Console.WriteLine($"Width: {image.Width}, Height {image.Height}");

05.pdf

sungaila commented 3 months ago

Unfortunately that is something I cannot fix. Pdfium is using float for measurement (not pixels) and once you render into a bitmap everything has to be rounded to int (now it's pixels). There is no possibility to avoid rounding here.

Your example 05.pdf has a rectangle that does not snap to pixels at 72 DPI. With anti-aliasing the blue outline is 41px thick (blurry because the border does not snap), without anti-aliasing it's 40px (bottom) and 41px (rest).

Try disabling the anti-aliasing:

using var input = File.OpenRead("05.pdf"); 

var dpi = 72; // For claifying the problem.

using var image = PDFtoImage.Conversion.ToImage(input, false, null, 0, new RenderOptions { Dpi = dpi, DpiRelativeToBounds = true, AntiAliasing = PdfAntiAliasing.None });

You will receive this: DocumentFull

Now keep the anti-aliasing disabled and use the correct bounds:

using var input = File.OpenRead("05.pdf"); 

var bounds = new RectangleF(822f, 1296f, 614f, 403f);

var dpi = 72; // For claifying the problem.

using var image = PDFtoImage.Conversion.ToImage(input, false, null, 0, new RenderOptions { Dpi = dpi, Bounds = bounds, DpiRelativeToBounds = true, AntiAliasing = PdfAntiAliasing.None });

You get this: DocumentCut

And at last add a 1px border:

using var input = File.OpenRead("05.pdf"); 

var bounds = new RectangleF(822f, 1296f, 614f, 403f);

// Expecting to get a ~1 pixel wide background outside of the blue border with the line below uncommented.
bounds = new RectangleF(bounds.Left - 1, bounds.Top -1, bounds.Width + 2, bounds.Height + 2);

var dpi = 72; // For claifying the problem.

using var image = PDFtoImage.Conversion.ToImage(input, false, null, 0, new RenderOptions { Dpi = dpi, Bounds = bounds, DpiRelativeToBounds = true, AntiAliasing = PdfAntiAliasing.None });

You get this: DocumentCutBorder

osjoberg commented 3 months ago

I think I understand the aliasing aspect, which is not really a problem for me. If the bounds have the size 612.4f x 402.2f could you not round up the resulting image size to 613 x 403 instead of truncating them (which looses pixel data regardless if you are using anti aliasing or not)?

(assuming 72 dpi for clarity)

sungaila commented 3 months ago

X and Y are rounded down while Width and Height are rounded up. The width and height of the output bitmap are always rounded down. That's something that I do not wish to change as it breaks all existing assumptions (esp. needed for tiled rendering).

But since DpiRelativeToBounds is an entirely new option, I could change it just for this one use case.

Besides, your calculated bounds are about 1px too small in width and height (with and without anti-aliasing).

sungaila commented 3 months ago

I've created new binaries to test. In this one the bitmap width and height will always be rounded up if DpiRelativeToBounds is activated. Please give it a try: PDFtoImage.4.0.2-debug.zip

osjoberg commented 3 months ago

Thank you, I will give it a try!

osjoberg commented 3 months ago

Besides, your calculated bounds are about 1px too small in width and height (with and without anti-aliasing).

How can you see that? My source for the coordinates are from selecting the rectangle in Illustrator which I assumed (perhaps erroneously) was 1:1 to the exact coordinates stored in the file.

sungaila commented 3 months ago

Besides, your calculated bounds are about 1px too small in width and height (with and without anti-aliasing).

How can you see that? My source for the coordinates are from selecting the rectangle in Illustrator which I assumed (perhaps erroneously) was 1:1 to the exact coordinates in the stored file.

I'm sorry that I didn't justify this clearly. I mean if you render your PDF with pdfium or Ghostscript, the resulting rectangle will be bigger than your given bounds (with and without anti-aliasing).

Using Illustrator coordinates 1:1 might not work here because we are at the mercy of pdfium on how the rectangle will be rendered. The rectangle could be 612.4x402.2 in the document but pdfium might decide to round its size up.

If you add all the rounding done to get pixel values, all bets are off when using coordinates with fractional part.

osjoberg commented 3 months ago

I find it odd that Pdfium/Ghostscript could not get this right, it could also be that Illustrator is not presenting the floats correctly. I was trying to read the PDF in a text editor to see if I could see the coordinates somewhere but I was not able to see the actual rectangle coordinates.

Probably you already thought of this but one source of error could be that when you round down/truncate X and Y the corresponding amounts needs to be added back to the width and height before being they are rounded up... if you miss that you would probably get this effect as well...

sungaila commented 3 months ago

It looks like the exact coordinates are 822.5 1296.602 612.398 402.199 but that shouldn't make a difference.

q 822 569 613 403 rectclip
1 0 0 -1 0 2268 cm q
1 g
822.5 1367.484 161.727 113.145 re f
822.5 1296.602 612.398 402.199 re f
Q q
822.5 1296.602 612.398 402.199 re W n
q
822 1296 613 403 re W n

Probably you already thought of this but one source of error could be that when you round down/truncate X and Y the corresponding amounts needs to be added back to the width and height before being they are rounded up... if you miss that you would probably get this effect as well...

You are correct that I missed to add the remainders rounding up and down. I've created a new build with the corrected behavior: PDFtoImage.4.0.2-debug.zip

osjoberg commented 3 months ago

Thank you for all of your work! However it seems like when I run the 4.0.2-debug version, the aspect ratio seems a bit off.

Repro:

using var input = File.OpenRead("01.pdf"); 

var bounds = new RectangleF(41.7255f, 186f, 517.2745f, 74f);

var dpi = 720; // High DPI to show the problem.

using var image = PDFtoImage.Conversion.ToImage(input, false, null, 0, new RenderOptions { Dpi = dpi, Bounds = bounds, DpiRelativeToBounds = true });

using var output = File.OpenWrite("01bounds.png"); 
image.Encode(output, SKEncodedImageFormat.Png, 100);
output.Close();

Console.WriteLine($"Width: {image.Width}, Height {image.Height}");

01.pdf 01bounds.4.0.1.png 01bounds.4.0,2.png

If I compare the two outputs visually, it seems like the output is somewhat compressed along the Y-axis in the 4.0.2-debug version, however it is not clear-cut.

If I zoom in and draw a rectangle around the first E in a drawing program I will get the following sizes (+-2 pixels for me not accurately capturing the anti-aliased pixels): 4.0.1-debug 402 x 689 pixels 4.0.2-debug 403 x 676 pixels <--- minor change in width but larger change in height

If I do a full page save with new RenderOptions { Dpi = dpi } and find the edges of the letter E i get: 4.0.2-debug 400 x 689 pixels

I am not expecting it to be perfect to the pixel, but it seems like the 4.0.2-debug version does some uneven scaling that I cannot account for being related to rounding of the bounds or anti-aliasing.

sungaila commented 3 months ago

Please make sure you have the correct binary referenced as NuGet will cache the old 4.0.2 since it is the same version. If you reference the dll directly you don't have to worry about this.

At 72 DPI there is no difference between 4.0.1 and 4.0.2 except for one additional pixel in width (517 vs 518).

At 720 DPI there seems to be a little bit of stretching applied if you diff both images. image

However, when comparing the bounds between new RectangleF(41.7255f, 186f, 517.2745f, 74f) and new RectangleF(0f, 0f, 612f, 792f), you will find that the output is identical (ignoring the extra whitespace).

So I'd argue that there is nothing unexpected happening here. Since the last test build activating DpiRelativeToBounds will round the output up and add a few pixels. This will change the aspect ratio. Deactivation of DpiRelativeToBounds will revert to the old behavior where one pixel will be lost.

I am not expecting it to be perfect to the pixel

Not gonna happen in this library anyway. ;-) I really don't want to deal with floating-point arithmetic and it's consequences here. Heck, pdfium isn't even using double for better precision. The output always aims for "close enough".

osjoberg commented 3 months ago

I am certain that I never installed the old 4.0.2 version - you are so fast so I could not test it until the second version was ready. :)

OK, so there is no way to render within a bound without affecting the aspect-ratio of the rendered content?

The thing that I do not understand is how the addition of at most two pixels to the resulting image (considering rounding down and up of the x/y/width/height), which is really small can make the the height of the E grow by to 10 pixels in your comparison and 13 pixels when I measured it. In my thinking the difference in higher DPI should be smaller than in low DPI.

If it is due to floating point arithmetic precision, maybe the calculations could be reordered to avoid really low or really high numbers to preserve the accuracy?

I am sorry to bother you with this, I understand if you don't want to spend more time on this edge case. Maybe I will have a window to check the code on Friday so I can get a deeper understanding on how everting works, and if I find something I can open a new issue.

sungaila commented 3 months ago

The thing that I do not understand is how the addition of at most two pixels to the resulting image

To support your use case with the blue rectangle + 1px border (at 72 DPI) I started rounding the bitmap width and height up. Before that the width and height was truncated/rounded down. I compared the results with Gimp (uses Ghostscript) and truncating seems to be the way to go.

However, by changing the bitmap dimensions pdfium will stretch the content by 1px when activating DpiRelativeToBounds.

In my thinking the difference in higher DPI should be smaller than in low DPI.

Yes, the higher the DPI, the smaller the rounding errors. At least it should be.

If it is due to floating point arithmetic precision, maybe the calculations could be reordered to avoid really low or really high numbers to preserve the accuracy?

Good call, because I reordered the calculation of the dimensions. Now the DPI is applied first, then the remainder is added second. Otherwise the error introduced by the remainder will be multiplied with DPI which was not intended.

The difference between a full PDF render and using the bounds at higher DPI looks better now: image

PDFtoImage.100.0.0-test.zip

This is my final attempt and I would like to merge the pull request if you can accept the current imprecisions in 100.0.0-test. Feel free to take a look at PdfDocument.Render for the calculation and fix any mistakes.

osjoberg commented 3 months ago

I have gone through all my tests now, everything is looking good from my point of view. Thank you!

sungaila commented 3 months ago

@osjoberg Thank you so much for troubleshooting this issue and for your patience.

PDFtoImage 4.0.2 has been released. Please reopen this issue if something isn't working as expected.