Open TheLastRar opened 11 years ago
They seem related, yes. I can reproduce this using Windows 7 x64, Firefox Nightly 30.0a1 (HWA on, 64 bit) and the latest PDF.js development version. I heard from @yurydelendik that it looks better on OS X for some reason, but the main problem seems to be that the math for adding the image chunks together is a bit off.
This is one of the more common issues I experience with this viewer. I can probable add more example pdfs if you need.
Thank you. We also have #3331, but adding more example PDFs here is always welcome.
Chrome (using pdf.js web viewer) displays some of these better
https://www.dropbox.com/s/pwznkfix0vkwlz2/SI%20Lecture.pdf
https://www.dropbox.com/s/l2up0sc4jepo7ay/Debriefing%20and%20presentation%20on%20section%20drawing.pdf (Effecting both text and images)
https://www.dropbox.com/s/91knasncff7imhv/Lec2.pdf (Page 3 - Earth)
https://www.dropbox.com/s/trnaaknkx93vilx/Lec23.pdf (Most images)
https://www.dropbox.com/s/efqkus46fhtvl93/Lec1-3%20-%20Introduction.pdf (Effects text, see page 3)
https://www.dropbox.com/s/rdh5ph4gl7xu6bq/Lec14%2C16%20-%20Pollution.pdf (text and images)
https://www.dropbox.com/s/exu3hi6moumkj8g/Lab2%20-%20page2.pdf (Image on Page 2, even chrome has issues with this)
https://www.dropbox.com/s/m3h40cibxqcw6k0/Lec2.pdf (most images from page 7 onwards)
https://www.dropbox.com/s/49064yy91egq7j0/Lec3.1.pdf (images on page 3-4)
http://www.edexcel.com/migrationdocuments/GCE%20New%20GCE/N38210A-GCE-Mathematical-Formulae-Statistical-Tables.pdf (Link dead, see wayback link below)
https://web.archive.org/web/20120917085705/http://www.edexcel.com/migrationdocuments/GCE%20New%20GCE/N38210A-GCE-Mathematical-Formulae-Statistical-Tables.pdf (blue banner on page 1)
Edit: also https://www.dropbox.com/s/rxk1n87pmumrixl/Doc4%20-%20Lec7%20-%20Cermics.pdf (all slides) the background for the above pdf is black when it should be white. I see that there is a similar issue open for black backgrounds, do you want me to open a separate bug for this pdf anyway?
Wow, thanks for the good test cases! I haven't checked all of them, but I could reproduce the ones I did check.
Regarding the black background: this is the first of those issues that I can actually reproduce on Windows. It might be best to just add that file with your configuration details (OS name and version, browser name and version, PDF.js version) at the existing issue.
Updated Dropbox links
Here's another example. But it includes a solution.
To fix it up, I got the original image, resized it and paste it into the document that I made the pdf with.
I was using MSword, just changing the size in msword didn't help. But resizing the original image and then pasting it again fixed it. Below is an example of two different images resized...
http://www.pdf-archive.com/2015/03/07/test2200-2/test2200.pdf
I made the pdf with msword, pdfcreator v2.0.2
This problem doesn't happen in Google Chrome. But it happens in Firefox, and the evince app in linux.
Has there been any progress made regarding this issue? Could someone point me in the right direction in regards to where PDF.JS begins to attach each object of the pdf onto the canvas?
I assume the issue is caused by incorrectly stitching multiple images together. In the first sample PDF provided, the main image is actually composed of 7 smaller/sliced images.
I think the easiest way to debug this is to use gulp server
locally and prepend #pdfBug=Stepper
to the URL, for example http://localhost:8888/web/viewer.html#pdfBug=Stepper
. This will open the PDF debugging utilities and allow you to step through the drawing operations so you can see which operation causes the defect. More information can be found at https://github.com/mozilla/pdf.js/wiki/Debugging-pdf.js#stepper.
I have looked into this and I think I have found the problem, although, I am not 100% sure how drawing works when width is more accurate than pixels can account for, but i have an idea.
It appears that when calling .drawImage on our canvas object, we do so given the exact height and width of the object that we wish to draw (image, rectangle, etc). By doing this, without taking pixel width into account, it seems that we are leaving some pixels partially colored. When stitching multiple parallel images together, this leaves a visible gap between them.
Ex. Given a pixel width of 0.5 and an image with width 1.02, we are coloring 2 full pixels and then leaving the 3rd pixel partially transparent. The next image to be stitched then starts at the 4th pixel, leaving the gap that the 3rd pixel has generated.
Using this sample file (CompWiz_Page1.pdf), which draws many rectangles on the canvas to create a gradient effect, I have came up with the following solution.
In Canvas.constructPath(), we are creating the rectangle's that we wish to attach to our canvas. In order to account for Pixel Size, I made the following modifications (see comments):
BEFORE
constructPath: function CanvasGraphics_constructPath(ops, args) {
var ctx = this.ctx;
var current = this.current;
var x = current.x, y = current.y;
for (var i = 0, j = 0, ii = ops.length; i < ii; i++) {
switch (ops[i] | 0) {
case OPS.rectangle:
x = args[j++];
y = args[j++];
var width = args[j++];
var height = args[j++];
if (width === 0) {
width = this.getSinglePixelWidth();
}
if (height === 0) {
height = this.getSinglePixelWidth();
}
var xw = x + width;
var yh = y + height;
AFTER
constructPath: function CanvasGraphics_constructPath(ops, args) {
var ctx = this.ctx;
var current = this.current;
var x = current.x, y = current.y;
var origX, origY, pixelSize = this.getSinglePixelWidth();
for (var i = 0, j = 0, ii = ops.length; i < ii; i++) {
switch (ops[i] | 0) {
case OPS.rectangle:
origX = args[0];
origY = args[1];
//Round down to the nearest multiple of our pixel width
x = Math.floor(args[j++]/pixelSize)*pixelSize;
y = Math.floor(args[j++]/pixelSize)*pixelSize;
var width = args[j++];
var height = args[j++];
if (width === 0) {
width = this.getSinglePixelWidth();
}
if (height === 0) {
height = this.getSinglePixelWidth();
}
//Use the non rounded original x/y values to round up to nearest pixel width multiple.
var xw = Math.ceil((origX + width)/pixelSize)*pixelSize;
var yh = Math.ceil((origY + height)/pixelSize)*pixelSize;
Doing this ensures that each pixel is fully colored and there is no transparency between stitched images.
I was able to fix the line issue that Doc2 (first example file attached in OP comment) has with his image using the same logic and modifying Canvas.paintJpegXObject() and Canvas.paintInlineImageXObject(). The following lines were added before writing the images to the canvas:
var pixelSize = this.getSinglePixelWidth();
h = Math.ceil(h/pixelSize)*pixelSize;
w = Math.ceil(w/pixelSize)*pixelSize;
If this is a viable solution, I think that it would need to be implemented before each drawing interaction with our canvas. If that is the case, I'd be happy to work on its implementation.
this.getSinglePixelWidth();
If this is a viable solution, I think that it would need to be implemented before each drawing interaction with our canvas. If that is the case, I'd be happy to work on its implementation.
Please note that that particular method may be "expensive" (performance wise), so using it everywhere would probably require very careful benchmarking; please refer to e.g. the discussion starting in https://github.com/mozilla/pdf.js/pull/4615#discussion_r11567342 and the results in https://github.com/mozilla/pdf.js/pull/4615#issuecomment-40874837.
If you're interested in benchmarking this, you can use our benchmarking tool as described on https://github.com/mozilla/pdf.js/wiki/Benchmarking-your-changes.
Interesting, I just got caught up on the topic, I was unaware that this solution has already been thought about.
Is it possible to see at what number of calls to .getSinglePixelWidth() we start to see a regression in performance? I am not familiar with how the #pdfbug=Stepper works, but if pdf.js has access to the load operations and can count the number of calls to Restore/Transform/ShowText (the methods resetting our cache), we can determine the amount of times we would need to call .getPixelWidth().
If that value falls under our performance regression threshold, we can proceed to load the document with accurate pixel measurements.
Something like this would allow for higher quality document display in standard cases while avoiding the performance regression seen in extreme cases.
I think the stepper should provide you with that. It shows you when and in which order it performs a drawing operation, for example restore/transform since they are native PDF commands.
Is it possible to see at what number of calls to .getSinglePixelWidth() we start to see a regression in performance?
That sounds really difficult, since it seems like it'd be highly dependant on e.g. the complexity and the ordering of rendering OPS of particular PDF files; not to mention dependent on general computer/device performance.
but if pdf.js has access to the load operations and can count the number of calls to Restore/Transform/ShowText (the methods resetting our cache),
There would be overhead/complexity related to counting the operations as well, especially considering that only some ShowText commands are resetting the cachedGetSinglePixelWidth
cache.
Something like this would allow for higher quality document display in standard cases while avoiding the performance regression seen in extreme cases.
That seems immediately problematic, since it would introduce inconsistent rendering behaviour for different kinds of PDF files. (Some files rendering correctly, and others not, doesn't seem like great UX.)
It might happen that it's deemed OK to make use of getSinglePixelWidth()
more in canvas.js
, even if it would result in a slight regression in edge-cases (such as e.g. PDF files containing complex maps).
In the past, smaller regressions have been accepted in return for significant rendering improvements (in multiple files); but without actual benchmarking results[1], especially with larger and more complex PDF files, it's mostly speculation at this point :-)
I didn't mean to deter you from looking into this with https://github.com/mozilla/pdf.js/issues/3351#issuecomment-410348027; sorry if it came off that way!
Finally, one piece of advice (based on experience): Before you put a lot of time and effort into doing multiple benchmark runs, make sure that you run the entire test-suite first to check that there aren't any obvious regressions from your changes.[2]
[1] One small advice regarding https://github.com/mozilla/pdf.js/wiki/Benchmarking-your-changes. The example contains "rounds": 20,
however in practice you often need to use at least "rounds": 50,
to get statistically significant results.
[2] There's nothing worse than having to re-do entire benchmark runs just because of some simple error in code you wrote yourself. Don't ask me how I know this ;-)
Thank you for the through reply. I have started to mess around with the benchmark testing. Are there specific files which are noteworthy? Or is it best to run the benchmark on the whole list of test PDF's?
I think the PDF file in #2813 may be a good one for your benchmark since it's a complicated PDF with a lot of drawing operations. Other than the PDF files in this issue, I think taking a subset of let's say 20 random PDF files from the existing suite should be sufficient. Perhaps @Snuffleupagus has more ideas though.
Here are the results when testing 50 rounds on the above document:
-- Grouped By browser, stat --
browser | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ----- | -------------
Firefox | Overall | 50 | 16139 | 16974 | 834 | 5.17 |
Firefox | Page Request | 50 | 10 | 10 | 0 | -2.53 |
Firefox | Rendering | 50 | 16129 | 16963 | 834 | 5.17 |
What exactly is it that I am looking for from these results? Is there a rule of thumb as to what range of values from the baseline are acceptable?
What exactly is it that I am looking for from these results? Is there a rule of thumb as to what range of values from the baseline are acceptable?
Usually we look at the t-test column Result(P<.05)
. In this case it looks like the change isn't statistically significant since it's not labeled 'slower' or 'faster'. It's is kind of interesting that the mean of rendering is 5% worse and it's still not significant. Can you attach the json file that the benchmark generates?
browser | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | ---- | ------ | -------------
Firefox | Overall | 50 | 17427 | 16954 | -474 | -2.72 |
Firefox | Page Request | 50 | 11 | 9 | -2 | -14.34 |
Firefox | Rendering | 50 | 17416 | 16944 | -472 | -2.71 |
browser | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | ----- | ------ | -------------
Firefox | Overall | 50 | 18640 | 15158 | -3481 | -18.68 | faster
Firefox | Page Request | 50 | 11 | 10 | -1 | -5.59 |
Firefox | Rendering | 50 | 18629 | 15148 | -3481 | -18.69 | faster
I ran the benchmark 2 more times and above are the results. It seems strange to me that they differ so much from one another. Is this normal?
Here is the resulting json from the last current run (Renamed to .txt because .json uploading is not supported). current.txt
Is this normal?
Yes, benchmarking browser's can be very fickle. One issue with the current benchmark setup is that it basically just re-runs rendering a page many times in the same browser page, which is not a realistic scenario in the real world (usually we'll just render a page once, so first time paint is the most important). What we really should probably do is restart the browser on each round or run reach round under a different domain so the caches are clean. This is all beyond the scope of this bug, but if you're interested in working on it, I'd be happy to help.
Just to be clear on your above results, did you run it so baseline = no changes
and current = with your changes
? It seems odd that now it is showing up as faster vs your previous results.
That is a great point about the testing. After this is solved, I may dive into that to see what can be done.
For testing: Baseline = Original code w/ no changes Current = code w/ changes
For each test, I regenerated both the baseline and current results.
Are there any other tests that I should be running to check performance results?
Doesn't look like it's causing issues, so feel free to open up a PR.
There are different issues here:
The underlying problems are slightly different.
About images, we must draw them at integer coordinates and with integer dimensions, according to MDN, it should help to improve performance and in the pdf context it should remove the thin lines we have between two tiles. Anyway it should help to fix pdfs like: Doc2.pdf
About paths, the situation is slightly different, after apply current transformation we must round and add 0.5.
The 0.5
will guarantee that 1px lines are really 1px lines: there is an explanation on MDN to explain how lines are drawing in a canvas:
The extra transparency added around lines can lead to some wrong color rendering with very close thin lines see bug 1657838. So we probably need to round correctly coordinates ourselves which should induce a perf penalty but it could be more or less offset by a perf improvement when drawing, so for sure, we need to measure that.
Any progress ?
Another way to reproduce this bug, but with rectangles again instead of images;
The Word doc renders the background of the pasted text as a solid color, and the resulting PDF looks correct in Adobe and Chrome, but Firefox is bleeding through some of the background color in between each line of the code.
On pdfs where the creation software splits the image into smaller chunks, pdf.js will produce a white line between them while adobe reader renders them correctly.
Example pdf: https://www.dropbox.com/s/gt1b2y9bgf06pam/Doc2.pdf
Edit: may be related to issue #3331 except with images instead of filled rectangles.