hypothesis / h

Annotate with anyone, anywhere.
https://hypothes.is/
BSD 2-Clause "Simplified" License
2.93k stars 427 forks source link

PDF: Improve native <-> pdf.js transition #1416

Closed csillag closed 10 years ago

csillag commented 10 years ago

In our initial solution for #1120, when the user clicks on the page action icon to switch from the native PDF viewer to our combined PDF.js + Hypothes.is viewer, we reload the PDF file as part of the transition.

This works, but maybe we could do better; maybe we could omit making a repeated HTTP request, and activate the new viewer on the already loaded data. I am not sure this is feasible; what we do know is that this is what PDF.js does, when it's hijacking the PDF file loading process by default. (Currently, this logic only kicks in if Hypothes.is is already deployed on the tab while we load the PDF file.)

It might be possible to use the same trick on already loaded documents. If it is, then we can spare the second HTTP request.

This could be important for example for POST requests, when it might not be harmless to send second request.

csillag commented 10 years ago

Same goes to the reverse direction, too.

csillag commented 10 years ago

One more thing that would be nice to have is preserving the scroll position, when going from the native viewer to our viewer.

The problem is that the native viewer is only accessible to the HTML space as a binary embed object; I am not sure if we can extract any valuable information from there.

If we can, scroll position and maybe selection would be nice.

csillag commented 10 years ago

I started a discussion about this here; if it's possible, the members of this group should be able to tell how.

tilgovi commented 10 years ago

PDF.js for Chromium intercepts the load request and redirects to the viewer with the query string so that the viewer can fetch the PDF over XHR.

The webRequest API does not seem to provide the response body content in any of the callbacks. This makes sense, as Chromium may reasonably with to stream the body content and dispose of it after incremental parsing. For instance, I believe the show source might make a separate request and the inspector is obviously generated from the live DOM.

Also, we'll see if anyone on that mailing list can tell us otherwise but I almost guarantee there's no access to the PDFium APIs from the limited JS context we might have in the embed object PDF viewer frame.

This is all chasing down stuff that's not important enough to justify use of your time right now, I think. If we can trigger Hypothes.is after the PDF is loaded and replace to the tab with a viewer that embeds our sidebar, that's sufficient. I don't think it's important to switch back if the sidebar is disabled. I don't think it's important to preserve scroll position. I don't think it's import to worry about POST requests. PDF.js doesn't even support them. A comment in the source points to http://crbug.com/104058

Close this and move on.

csillag commented 10 years ago

PDF.js for Chromium intercepts the load request and redirects to the viewer with the query string so that the viewer can fetch the PDF over XHR.

Actually, they support several methods to intercept the loading process, based on various browser capabilities and permission. What you described is one loading method, but there are others, too.

The webRequest API does not seem to provide the response body content in any of the callbacks.

That's true, but the streamsPrivate API does, and PDF.js is ready to latch on that API, if it's enabled by Chrome.

Also, we'll see if anyone on that mailing list can tell us otherwise but [...]

Yeah, that's why I asked.

This is all chasing down stuff that's not important enough to justify use of your time right now, I think. If we can trigger Hypothes.is after the PDF is loaded and replace to the tab with a viewer that embeds our sidebar, that's sufficient.

That's implemented now. (See PR #1417, experimental build available here.)

I don't think it's important to switch back if the sidebar is disabled.

That's implemented now, too.

I don't think it's important to preserve scroll position.

Yeah, I agree. It's only a "nice to have".

I don't think it's import to worry about POST requests. PDF.js doesn't even support them.

It kind of does, since https://github.com/mozilla/pdf.js/commit/cdadb0db4d5eaaa2d7bd62a5551d191eeb598ef4.

Close this and move on.

Can I wait for answers on the mailing list first?

tilgovi commented 10 years ago

Can I wait for answers on the mailing list first?

If you insist. But I think I just did the work for you in a couple minutes:

It seems the CJS_Document is stored using JS_GetPrivate and JS_SetPrivate, meaning it's private data associated with the lifecycle of the JS runtime but not actually exported as a JS object.

Nothing in ./fpdfsdk/src/javascript/PublicMethods.cpp exports anything but some helpers, mostly for dates and parameter validation.

Poke around a bit more, and it seems that ./fpdfsdk/src/javascript is actually the JS API for PDF files. It defines all the builtins you might expect, like setTimeout, but also things that are obviously for PDF specifically, like addAnnot (annotations!), which is a PDF runtime function for JS inside the PDFs.

So, I don't see where else in this source tree there might be any code which exports stuff into the "document". All the initialization of a "Document" object appears to be for the Document context within the PDF. None of this stuff is exported to window.document we have access to. Nothing is exported to the window or the document. The embed object itself has no extra properties or methods.

tilgovi commented 10 years ago

Did you do any of this background research before posting your question?

csillag commented 10 years ago

On 2014-08-20 00:09, Randall Leeds wrote:

So, I don't see where else in this source tree there might be any code which /exports/ stuff into the "document". All the initialization of a "Document" object appears to be for the Document context /within/ the PDF. None of this stuff is exported to |window.document| we have access to.

For all I care, it can also be accessible via a chrome. API, too.

csillag commented 10 years ago

On 2014-08-20 00:11, Randall Leeds wrote:

Did you do any of this background research before posting your question?

I checked out the HTML5 embed specification http://www.w3.org/TR/html5/embedded-content-0.html#the-embed-element, and found that

The |HTMLEmbedElement
<http://www.w3.org/TR/html5/embedded-content-0.html#htmlembedelement>|
object representing the element must expose the scriptable interface
of the plugin
<http://www.w3.org/TR/html5/infrastructure.html#plugin> instantiated
for the |embed
<http://www.w3.org/TR/html5/embedded-content-0.html#the-embed-element>|
element, if any.

I also checked the documentation https://code.google.com/p/pdfium/wiki/Documentation of PDFium, which recommended this API doc http://cdn01.foxitsoftware.com/pub/foxit/manual/enu/FoxitPDF_SDK20_Guide.pdf although I suspect it's one layer lower than what we want.

I did not check whether or not Chromium's PDF plugin provides bridges to connect the two sides; that's what I asked on the list.

I am not too familiar with this specific layer of web technologies (JS api implementations inside Chromium), so I thought that it's more efficient to shoot a short question towards those who are than to learn about it right now.

tilgovi commented 10 years ago

Well, having written C++ extension modules for nodejs and worked a good bit on the couchjs map/reduce implementation for CouchDB that uses SpiderMonkey, I'm not unfamiliar with the patterns used by C/C++ code to interact with these runtimes and as far as I can tell nothing gets exported by PDFium.

It's possible the Chrome team has additional code not included in PDFium, and indeed I can't even see where or how the embed object is defined, but then it's undocumented and we shouldn't use it anyway.

If the embed element exported any scripting API, it would be visible by a simple inspection of the properties of the element, but there is none.

It's true that it's more efficient to ask a question sometimes, but don't discount the knowledge on your own team.

tilgovi commented 10 years ago

The streamsPrivate API you mention that PDF.js has implemented it also not a public, documented, exposed feature of Chrome/Chromium right now.

I'm closing this issue. If by some chance you get a response on the mailing list you can re-open it.

csillag commented 10 years ago

It's possible the Chrome team has additional code not included in PDFium, and indeed I can't even see where or how the embed object is defined,

From what I see, it's more like this: the Chromium team has built the PDF plugin on top of PDFium. They added a whole new layer. See their code here. I assume they defined the embed object, etc. (And they might also exported a bunch of things.)

but then it's undocumented and we shouldn't use it anyway.

I am not even sure if there is a designated official documentation for the PDF plugin, so I think the meaning of "undocumented" is a bit vague here.

csillag commented 10 years ago

The streamsPrivate API you mention that PDF.js has implemented it also not a public, documented, exposed feature of Chrome/Chromium right now.

I would say let that be the problem of PDF.js. If they have built support for it, we might as well use it. It's an optional feature for them, with fallback in place for when it fails; it could be the same for us.


Anyway, as I wrote in the OP, these are all "nice to have" features, not essential.

tilgovi commented 10 years ago

I would say let that be the problem of PDF.js. If they have built support for it, we might as well use it. It's an optional feature for them, with fallback in place for when it fails; it could be the same for us.

The request to whitelist PDF.js was marked as won't fix by the person responsible for the streamsPrivate API. https://code.google.com/p/chromium/issues/detail?id=334831

Rob even seemed to agree with this, and opened a new bug, but there has been no work on it yet. https://code.google.com/p/chromium/issues/detail?id=104058

But more to the point, in general hack on whatever you like in your own time and submit a PR. I don't want to see this tracker filled up with feature requests, let alone ones that rely on uncertain speculation about future APIs in other projects.

https://gettingreal.37signals.com/ch05_Start_With_No.php

And just to close the loop on APIs for the embedded object, nothing in the code you linked suggests that the plugin itself has any public API. There is a private API for Chromium to use, in ./src/ppapi/cpp/private/pdf.cc but there is no public API as far as I can tell.

csillag commented 10 years ago

But more to the point, in general hack on whatever you like in your own time and submit a PR. I don't want to see this tracker filled up with feature requests, let alone ones that rely on uncertain speculation about future APIs in other projects.

Can't we have placeholder issues about things that would be nice to have?

(I have no intention to hack on this; my goal with opening the ticket was to signal that there is some room for improvement here.)

tilgovi commented 10 years ago

Can't we have placeholder issues about things that would be nice to have?

Not without prior discussion. Bring it up in a meeting or on the mailing list and if we agree we want something we will make a ticket for it.

This issue tracker is full of stuff like this and it makes it a full time job to triage actual bugs. We need to stop this habit.

csillag commented 10 years ago

On 2014-08-20 02:14, Randall Leeds wrote:

Can't we have placeholder issues about things that would be nice
to have?

Not without prior discussion. Bring it up in a meeting or on the mailing list and if we agree we want something we will make a ticket for it.

How about creating a separate issue tracker, where we can open issues like this without prior discussion? I find it useful to make notes like this right away when I actually encounter these problems / opportunities.

We could discuss them at meetings or on the mailing list afterward.

This issue tracker is full of stuff like this and it makes it a full time job to triage actual bugs. We need to stop this habit.

Well, we can always search for the "bug" label ... but anyway, having a separate repo would avoid littering the main tracker with tickets like this.

tilgovi commented 10 years ago

How about creating a separate issue tracker, where we can open issues like this without prior discussion?

I already tried that at hypothesis/vision. However, that seems to have evolved into only high level stuff. I think that makes sense.

Is it so much to ask that you send an email to our mailing list instead of opening an issue?

tilgovi commented 10 years ago

Actually, looking at what's happening here, I'm going to stop responding because this thread is way off topic.

If you want to discuss this process further, go start the conversation on the mailing list. Otherwise, send emails or bring it up in meetings if you have a feature request.

I'm going to close feature requests with prejudice from here out until I see a discussion that generates consensus to do otherwise.