mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.34k stars 9.97k forks source link

PDF coordinates from backend to pdfjs #10535

Closed jbham closed 5 years ago

jbham commented 5 years ago

Attach (recommended) or Link to PDF file here:

Configuration:

Steps to reproduce the problem: None. I am logging this as a question. No issue.

What is the expected behavior? (add screenshot) N/A

What went wrong? (add screenshot) N/A

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension): N/A

My question is as follows:

What have I done so far:

var tx = pdfjsLib.Util.transform( pdfjsLib.Util.transform( viewport.transform, textItems[i].transform ), [1, 0, 0, -1, 0, 0] );

However, this does not give me the expected numbers either.

If I could get some help on this then I would very appreciative!! In the meantime, I'll continue to step through code to see if I can locate something. But, I have logged this request after exhaustive investigation over a week. Maybe I am just not looking in the right place or don't understand things :(

jbham commented 5 years ago

After further investigation, I found a common number that shows a difference between transform[5] coordinate from pdfjs and rectangle calculation from the backend solution:

Backend solution rectangle coordinates: (482.144, 103.423, 533.926, 112.757) PDFJS transform: [8.630225, 0, 0, 9.333, 482.145, 664.62]

My calculation:

662.323 (775.08 -112.757) from backend solution....im doing this because the coordinates are flipped between my backend solution and pdfjs (canvas) VS 664.62 from PDFJS

Difference is 2.297

This number holds true for all of the items PDFJS finds.

You guys are the experts and I would greatly appreciate if you could shed some light on how this 2.297 is being derived. Can I use this as a constant for all of the items?

Appreciate in advance you taking the time to look into this for me.

Thanks!

timvandermeij commented 5 years ago

This looks like #10508 which was merged just a few days ago. It fixed #8276 which caused transform arrays to be calculated incorrectly. You could verify this by checking out the code from the master branch.

jbham commented 5 years ago

Thanks @timvandermeij! I tried the following and still got the same number:

And still got the transform[5] as 664.62.

Anything else would you recommend checking if all above steps were followed correctly?

This maybe too much to ask, but if it possible: using just plain addition/subtraction, would you know if pdfjs is calculating transform[5] correctly?

Thanks again!!

jbham commented 5 years ago

@timvandermeij - I have hunch behind that mysterious difference and wanted to see if I could run it by you:

I used "http://brendandahl.github.io/pdf.js.utils/browser/" to review my pdf doc's internals and those rectangle coordinates and other data point, that I mentioned in above comments, pertains to below string (07/14/2016):

Tm /TT1 1 Tf (ate: ) Tj ET Q q 1 0 0 1 0 -0.06 cm BT 8.630225 0 0 9.333 482.145 664.68
Tm /TT1 1 Tf (07/14/2016) Tj ET Q q 1 0 0 1 0 -0.06 cm BT 14.62766 0 0 10.667 57.164 604.2

When I stepped through the code, I saw this matrix ([1 0 0 1 0 -0.06]) used in matrix multiplication at multiple different places. There is a "cm" after this matrix which I think stands of centimeter. This is a total guess as I am not aware of PDF's internal workings, but, def learned quite a bit in last week lol...feel free to shoot it down if it is incorrect.

I converted the -0.06 from cm to px and got "-2.267716535". In my backend solution, I converted the rectangle coordinates that I was getting with a matrix which was converted from cm to px: [1 0 0 1 0 -2.267716535] and I got values which are fairly close to what PDFJS is giving me in transform[5].

Converting that -0.06 to computer points didn't give me the mysterious difference.

Would you say I am on the right track?

I can't share the actual PDF because it has sensitive date. Sorry.

Thanks!

Snuffleupagus commented 5 years ago

Attach (recommended) or Link to PDF file here:

Without access to a test-case, this issue is unfortunately not actionable.

  • For a specific string on a PDF, I am getting a rectangle from backend solution (482.144, 103.423, 533.926, 112.757) and width of 51.78 and height of the font is 9.233.

  • Page's view is [0, 0, 594.72, 775.08].

  • When I load the same pdf in pdfjs, I get a transform array as [8.630225, 0, 0, 9.333, 482.145, 664.62] for the same specific string.

This entire issue seems to be based on the assumption that the result obtained from the backend is correct and that PDF.js is wrong. However, what if it's actually the other way around? (Please note that I'm not saying there isn't a problem in the PDF.js library, but it seems reasonable to "challenge" what seems to be the basis for this issue.)

Furthermore, without access to neither the PDF file nor the complete implementation it's unfortunately not possible for anyone to provide assistance; please see https://github.com/mozilla/pdf.js/blob/master/.github/CONTRIBUTING.md (emphasis mine):

If you are developing a custom solution, first check the examples at https://github.com/mozilla/pdf.js#learning and search existing issues. If this does not help, please prepare a short well-documented example that demonstrates the problem and make it accessible online on your website, JS Bin, GitHub, etc. before opening a new issue or contacting us on the IRC channel -- keep in mind that just code snippets won't help us troubleshoot the problem.


There is a "cm" after this matrix which I think stands of centimeter.

cm is simply the transform operator, it relates to the "user space units" in the PDF specification.

jbham commented 5 years ago

Thanks @Snuffleupagus for taking the time and replying! I was afraid this question would come up and I would not have an answer for it, which is why I waited for a week to do as much due diligence as possible. Let me see if I can use a generic pdf and recreate an example. However, I would like to point out that I don't think there is an issue on either side at this time because I don't know how transform[5] value is calculated. Hence I logged this request as a question and said:

Steps to reproduce the problem: None. I am logging this as a question. No issue.

If I could get some knowledge on how transform[5] value is calculated then that should be sufficient for me to go back and figure out whether the issue is in pdfjs or backend solution. Going through pdfjs issues in github, I wasn't able to find an answer to this. If the answer to this question is no, then feel free to close this.

Thanks!

Snuffleupagus commented 5 years ago

If I could get some knowledge on how transform[5] value is calculated then that should be sufficient for me to go back and figure out whether the issue is in pdfjs or backend solution.

I'm not sure how helpful this is, but the relevant part of the PDF specification can be found here: https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1910927 and the current PDF.js implementation can be found here (a possible starting point could be this code): https://github.com/mozilla/pdf.js/blob/dfe7d9bc26548cb8ad6ff09eb66efa85ce248bb9/src/core/evaluator.js#L1278-L1866

jbham commented 5 years ago

Thanks @Snuffleupagus. I was already going through the loop but I wasn't able to determine the exact pattern. But, let me see if I can create a PDF with just that one value and its associating rect coordinates. That should make it easier to go through that loop.

If possible, can we keep this ticket open for few days please?

Snuffleupagus commented 5 years ago

@timvandermeij Should we close this now, since there's unfortunately not enough information to make it actionable from a PDF.js point-of-view.