mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.48k stars 9.98k forks source link

White lines though images on certain pdfs #3351

Open TheLastRar opened 11 years ago

TheLastRar commented 11 years ago

On pdfs where the creation software splits the image into smaller chunks, pdf.js will produce a white line between them while adobe reader renders them correctly.

Example pdf: https://www.dropbox.com/s/gt1b2y9bgf06pam/Doc2.pdf

Edit: may be related to issue #3331 except with images instead of filled rectangles.

timvandermeij commented 10 years ago

They seem related, yes. I can reproduce this using Windows 7 x64, Firefox Nightly 30.0a1 (HWA on, 64 bit) and the latest PDF.js development version. I heard from @yurydelendik that it looks better on OS X for some reason, but the main problem seems to be that the math for adding the image chunks together is a bit off.

TheLastRar commented 10 years ago

This is one of the more common issues I experience with this viewer. I can probable add more example pdfs if you need.

timvandermeij commented 10 years ago

Thank you. We also have #3331, but adding more example PDFs here is always welcome.

TheLastRar commented 10 years ago

Chrome (using pdf.js web viewer) displays some of these better

https://www.dropbox.com/s/pwznkfix0vkwlz2/SI%20Lecture.pdf

https://www.dropbox.com/s/l2up0sc4jepo7ay/Debriefing%20and%20presentation%20on%20section%20drawing.pdf (Effecting both text and images)

https://www.dropbox.com/s/91knasncff7imhv/Lec2.pdf (Page 3 - Earth)

https://www.dropbox.com/s/trnaaknkx93vilx/Lec23.pdf (Most images)

https://www.dropbox.com/s/efqkus46fhtvl93/Lec1-3%20-%20Introduction.pdf (Effects text, see page 3)

https://www.dropbox.com/s/rdh5ph4gl7xu6bq/Lec14%2C16%20-%20Pollution.pdf (text and images)

https://www.dropbox.com/s/exu3hi6moumkj8g/Lab2%20-%20page2.pdf (Image on Page 2, even chrome has issues with this)

https://www.dropbox.com/s/m3h40cibxqcw6k0/Lec2.pdf (most images from page 7 onwards)

https://www.dropbox.com/s/49064yy91egq7j0/Lec3.1.pdf (images on page 3-4)

http://www.edexcel.com/migrationdocuments/GCE%20New%20GCE/N38210A-GCE-Mathematical-Formulae-Statistical-Tables.pdf (Link dead, see wayback link below) https://web.archive.org/web/20120917085705/http://www.edexcel.com/migrationdocuments/GCE%20New%20GCE/N38210A-GCE-Mathematical-Formulae-Statistical-Tables.pdf (blue banner on page 1)

Edit: also https://www.dropbox.com/s/rxk1n87pmumrixl/Doc4%20-%20Lec7%20-%20Cermics.pdf (all slides) the background for the above pdf is black when it should be white. I see that there is a similar issue open for black backgrounds, do you want me to open a separate bug for this pdf anyway?

timvandermeij commented 10 years ago

Wow, thanks for the good test cases! I haven't checked all of them, but I could reproduce the ones I did check.

Regarding the black background: this is the first of those issues that I can actually reproduce on Windows. It might be best to just add that file with your configuration details (OS name and version, browser name and version, PDF.js version) at the existing issue.

TheLastRar commented 10 years ago

Updated Dropbox links

niknah commented 9 years ago

Here's another example. But it includes a solution.

To fix it up, I got the original image, resized it and paste it into the document that I made the pdf with.

I was using MSword, just changing the size in msword didn't help. But resizing the original image and then pasting it again fixed it. Below is an example of two different images resized...

http://www.pdf-archive.com/2015/03/07/test2200-2/test2200.pdf

I made the pdf with msword, pdfcreator v2.0.2

This problem doesn't happen in Google Chrome. But it happens in Firefox, and the evince app in linux.

brianholle commented 6 years ago

Has there been any progress made regarding this issue? Could someone point me in the right direction in regards to where PDF.JS begins to attach each object of the pdf onto the canvas?

I assume the issue is caused by incorrectly stitching multiple images together. In the first sample PDF provided, the main image is actually composed of 7 smaller/sliced images.

timvandermeij commented 6 years ago

I think the easiest way to debug this is to use gulp server locally and prepend #pdfBug=Stepper to the URL, for example http://localhost:8888/web/viewer.html#pdfBug=Stepper. This will open the PDF debugging utilities and allow you to step through the drawing operations so you can see which operation causes the defect. More information can be found at https://github.com/mozilla/pdf.js/wiki/Debugging-pdf.js#stepper.

brianholle commented 6 years ago

I have looked into this and I think I have found the problem, although, I am not 100% sure how drawing works when width is more accurate than pixels can account for, but i have an idea.

It appears that when calling .drawImage on our canvas object, we do so given the exact height and width of the object that we wish to draw (image, rectangle, etc). By doing this, without taking pixel width into account, it seems that we are leaving some pixels partially colored. When stitching multiple parallel images together, this leaves a visible gap between them.

Ex. Given a pixel width of 0.5 and an image with width 1.02, we are coloring 2 full pixels and then leaving the 3rd pixel partially transparent. The next image to be stitched then starts at the 4th pixel, leaving the gap that the 3rd pixel has generated.

Using this sample file (CompWiz_Page1.pdf), which draws many rectangles on the canvas to create a gradient effect, I have came up with the following solution.

In Canvas.constructPath(), we are creating the rectangle's that we wish to attach to our canvas. In order to account for Pixel Size, I made the following modifications (see comments):

BEFORE

    constructPath: function CanvasGraphics_constructPath(ops, args) {
      var ctx = this.ctx;
      var current = this.current;
      var x = current.x, y = current.y;
      for (var i = 0, j = 0, ii = ops.length; i < ii; i++) {
        switch (ops[i] | 0) {
          case OPS.rectangle:
            x = args[j++];
            y = args[j++];
            var width = args[j++];
            var height = args[j++];
            if (width === 0) {
              width = this.getSinglePixelWidth();
            }
            if (height === 0) {
              height = this.getSinglePixelWidth();
            }
            var xw = x + width;
            var yh = y + height;

AFTER

    constructPath: function CanvasGraphics_constructPath(ops, args) {
      var ctx = this.ctx;
      var current = this.current;
      var x = current.x, y = current.y;
      var origX, origY, pixelSize = this.getSinglePixelWidth();
      for (var i = 0, j = 0, ii = ops.length; i < ii; i++) {
        switch (ops[i] | 0) {
          case OPS.rectangle:
            origX = args[0];
            origY = args[1];
            //Round down to the nearest multiple of our pixel width
            x = Math.floor(args[j++]/pixelSize)*pixelSize; 
            y = Math.floor(args[j++]/pixelSize)*pixelSize;
            var width = args[j++];
            var height = args[j++];
            if (width === 0) {
              width = this.getSinglePixelWidth();
            }
            if (height === 0) {
              height = this.getSinglePixelWidth();
            }
            //Use the non rounded original x/y values to round up to nearest pixel width multiple.
            var xw = Math.ceil((origX + width)/pixelSize)*pixelSize; 
            var yh = Math.ceil((origY + height)/pixelSize)*pixelSize;

Doing this ensures that each pixel is fully colored and there is no transparency between stitched images.

I was able to fix the line issue that Doc2 (first example file attached in OP comment) has with his image using the same logic and modifying Canvas.paintJpegXObject() and Canvas.paintInlineImageXObject(). The following lines were added before writing the images to the canvas:

var pixelSize = this.getSinglePixelWidth();
h = Math.ceil(h/pixelSize)*pixelSize;
w = Math.ceil(w/pixelSize)*pixelSize;

If this is a viable solution, I think that it would need to be implemented before each drawing interaction with our canvas. If that is the case, I'd be happy to work on its implementation.

Snuffleupagus commented 6 years ago

this.getSinglePixelWidth();

If this is a viable solution, I think that it would need to be implemented before each drawing interaction with our canvas. If that is the case, I'd be happy to work on its implementation.

Please note that that particular method may be "expensive" (performance wise), so using it everywhere would probably require very careful benchmarking; please refer to e.g. the discussion starting in https://github.com/mozilla/pdf.js/pull/4615#discussion_r11567342 and the results in https://github.com/mozilla/pdf.js/pull/4615#issuecomment-40874837.

timvandermeij commented 6 years ago

If you're interested in benchmarking this, you can use our benchmarking tool as described on https://github.com/mozilla/pdf.js/wiki/Benchmarking-your-changes.

brianholle commented 6 years ago

Interesting, I just got caught up on the topic, I was unaware that this solution has already been thought about.

Is it possible to see at what number of calls to .getSinglePixelWidth() we start to see a regression in performance? I am not familiar with how the #pdfbug=Stepper works, but if pdf.js has access to the load operations and can count the number of calls to Restore/Transform/ShowText (the methods resetting our cache), we can determine the amount of times we would need to call .getPixelWidth().

If that value falls under our performance regression threshold, we can proceed to load the document with accurate pixel measurements.

Something like this would allow for higher quality document display in standard cases while avoiding the performance regression seen in extreme cases.

timvandermeij commented 6 years ago

I think the stepper should provide you with that. It shows you when and in which order it performs a drawing operation, for example restore/transform since they are native PDF commands.

Snuffleupagus commented 6 years ago

Is it possible to see at what number of calls to .getSinglePixelWidth() we start to see a regression in performance?

That sounds really difficult, since it seems like it'd be highly dependant on e.g. the complexity and the ordering of rendering OPS of particular PDF files; not to mention dependent on general computer/device performance.

but if pdf.js has access to the load operations and can count the number of calls to Restore/Transform/ShowText (the methods resetting our cache),

There would be overhead/complexity related to counting the operations as well, especially considering that only some ShowText commands are resetting the cachedGetSinglePixelWidth cache.

Something like this would allow for higher quality document display in standard cases while avoiding the performance regression seen in extreme cases.

That seems immediately problematic, since it would introduce inconsistent rendering behaviour for different kinds of PDF files. (Some files rendering correctly, and others not, doesn't seem like great UX.)


It might happen that it's deemed OK to make use of getSinglePixelWidth() more in canvas.js, even if it would result in a slight regression in edge-cases (such as e.g. PDF files containing complex maps).

In the past, smaller regressions have been accepted in return for significant rendering improvements (in multiple files); but without actual benchmarking results[1], especially with larger and more complex PDF files, it's mostly speculation at this point :-)

I didn't mean to deter you from looking into this with https://github.com/mozilla/pdf.js/issues/3351#issuecomment-410348027; sorry if it came off that way!

Finally, one piece of advice (based on experience): Before you put a lot of time and effort into doing multiple benchmark runs, make sure that you run the entire test-suite first to check that there aren't any obvious regressions from your changes.[2]


[1] One small advice regarding https://github.com/mozilla/pdf.js/wiki/Benchmarking-your-changes. The example contains "rounds": 20, however in practice you often need to use at least "rounds": 50, to get statistically significant results.

[2] There's nothing worse than having to re-do entire benchmark runs just because of some simple error in code you wrote yourself. Don't ask me how I know this ;-)

brianholle commented 6 years ago

Thank you for the through reply. I have started to mess around with the benchmark testing. Are there specific files which are noteworthy? Or is it best to run the benchmark on the whole list of test PDF's?

timvandermeij commented 6 years ago

I think the PDF file in #2813 may be a good one for your benchmark since it's a complicated PDF with a lot of drawing operations. Other than the PDF files in this issue, I think taking a subset of let's say 20 random PDF files from the existing suite should be sufficient. Perhaps @Snuffleupagus has more ideas though.

brianholle commented 6 years ago

Here are the results when testing 50 rounds on the above document:

-- Grouped By browser, stat --
browser | stat         | Count | Baseline(ms) | Current(ms) | +/- |    %  | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ----- | -------------
Firefox | Overall      |    50 |        16139 |       16974 | 834 |  5.17 |              
Firefox | Page Request |    50 |           10 |          10 |   0 | -2.53 |              
Firefox | Rendering    |    50 |        16129 |       16963 | 834 |  5.17 |              

What exactly is it that I am looking for from these results? Is there a rule of thumb as to what range of values from the baseline are acceptable?

brendandahl commented 6 years ago

What exactly is it that I am looking for from these results? Is there a rule of thumb as to what range of values from the baseline are acceptable?

Usually we look at the t-test column Result(P<.05). In this case it looks like the change isn't statistically significant since it's not labeled 'slower' or 'faster'. It's is kind of interesting that the mean of rendering is 5% worse and it's still not significant. Can you attach the json file that the benchmark generates?

brianholle commented 6 years ago
browser | stat         | Count | Baseline(ms) | Current(ms) |  +/- |     %  | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | ---- | ------ | -------------
Firefox | Overall      |    50 |        17427 |       16954 | -474 |  -2.72 |              
Firefox | Page Request |    50 |           11 |           9 |   -2 | -14.34 |              
Firefox | Rendering    |    50 |        17416 |       16944 | -472 |  -2.71 |   
browser | stat         | Count | Baseline(ms) | Current(ms) |   +/- |     %  | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | ----- | ------ | -------------
Firefox | Overall      |    50 |        18640 |       15158 | -3481 | -18.68 |        faster
Firefox | Page Request |    50 |           11 |          10 |    -1 |  -5.59 |              
Firefox | Rendering    |    50 |        18629 |       15148 | -3481 | -18.69 |        faster

I ran the benchmark 2 more times and above are the results. It seems strange to me that they differ so much from one another. Is this normal?

Here is the resulting json from the last current run (Renamed to .txt because .json uploading is not supported). current.txt

brendandahl commented 6 years ago

Is this normal?

Yes, benchmarking browser's can be very fickle. One issue with the current benchmark setup is that it basically just re-runs rendering a page many times in the same browser page, which is not a realistic scenario in the real world (usually we'll just render a page once, so first time paint is the most important). What we really should probably do is restart the browser on each round or run reach round under a different domain so the caches are clean. This is all beyond the scope of this bug, but if you're interested in working on it, I'd be happy to help.

Just to be clear on your above results, did you run it so baseline = no changes and current = with your changes? It seems odd that now it is showing up as faster vs your previous results.

brianholle commented 6 years ago

That is a great point about the testing. After this is solved, I may dive into that to see what can be done.

For testing: Baseline = Original code w/ no changes Current = code w/ changes

For each test, I regenerated both the baseline and current results.

brianholle commented 6 years ago

Are there any other tests that I should be running to check performance results?

brendandahl commented 6 years ago

Doesn't look like it's causing issues, so feel free to open up a PR.

calixteman commented 2 years ago

There are different issues here:

The underlying problems are slightly different.

About images, we must draw them at integer coordinates and with integer dimensions, according to MDN, it should help to improve performance and in the pdf context it should remove the thin lines we have between two tiles. Anyway it should help to fix pdfs like: Doc2.pdf

About paths, the situation is slightly different, after apply current transformation we must round and add 0.5. The 0.5 will guarantee that 1px lines are really 1px lines: there is an explanation on MDN to explain how lines are drawing in a canvas: image

The extra transparency added around lines can lead to some wrong color rendering with very close thin lines see bug 1657838. So we probably need to round correctly coordinates ourselves which should induce a perf penalty but it could be more or less offset by a perf improvement when drawing, so for sure, we need to measure that.

yi-jy commented 10 months ago

Any progress ?

FuriousProgrammer commented 8 months ago

Another way to reproduce this bug, but with rectangles again instead of images;

  1. Write some code in Sublime Text, using a non-white background
  2. Edit->Copy as HTML
  3. Paste into a Word doc
  4. Export as PDF

The Word doc renders the background of the pasted text as a solid color, and the resulting PDF looks correct in Adobe and Chrome, but Firefox is bleeding through some of the background color in between each line of the code.

image issue_3351_sample.pdf