mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.63k stars 10k forks source link

PDFDocumentLoadingTask.destroy() causes other tasks running on the same worker to never complete #16777

Closed vanaigr closed 1 year ago

vanaigr commented 1 year ago

"Solution": awaiting on destroy() or not using workerPort

PDF file: The issue happens with any PDF file, one PDF is embedded into the example

Configuration:

Steps to reproduce the problem:

Here is an HTML page that, when opened, reproduces the issue:

<html>
    <head>
        <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.9.179/pdf.js"></script>
        <script>
//pdfjsLib.GlobalWorkerOptions.workerSrc = 'https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.9.179/pdf.worker.js';
const workerCodeURL = URL.createObjectURL(new Blob(
    ['importScripts("https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.9.179/pdf.worker.js")'],
    { type: 'text/javascript' }
)) //same origin policy for worker urls
const p = pdfjsLib.GlobalWorkerOptions.workerPort = new Worker(workerCodeURL);
URL.revokeObjectURL(workerCodeURL)

const pdf64 = 'JVBERi0xLjcKJYGBgYEKCjYgMCBvYmoKPDwKL0ZpbHRlciAvRmxhdGVEZWNvZGUKL0xlbmd0aCAxMjgKPj4Kc3RyZWFtCnic7cxBCoJQFIXh+VnFHQviM6T3BHEgBA2cBHcDIa8ocmCI6w9rCw1/Dvyjw7docAXb976rOufXltfHdC1jaFOTQkytHRrzm/aOqr/XX31Wd6x786e80Ml10QIHBwcHBwcHBwcHBwcHBwcHBwcHBwcHBwcHBwcHB/dv7gOFSNz/CmVuZHN0cmVhbQplbmRvYmoKCjcgMCBvYmoKPDwKL0ZpbHRlciAvRmxhdGVEZWNvZGUKL1R5cGUgL09ialN0bQovTiA1Ci9GaXJzdCAyNgovTGVuZ3RoIDM1OQo+PgpzdHJlYW0KeJzVUt9LwzAQfs9fcY/6IEnTrj9kDLa1VZChbIKi+JC1YVRGIm0q87/3Lu0cexCfhX4kd/dd8qX3BSBAQhRBCEkKEUxCCRNIRAbTKeOPXx8a+IPa6Y7xu6bu4BU5AtbwxvjS9sZBwGYzduIulVN7u2NDEwREPjIeWlv3lW5hWhZlKUQihIgjRCyEzHFdIjKExBhrMsU9IolGYC4JhQjnWCsHxMnQQ3XPnYz9Ba7IjYmTD9woHeKfe+muYjhD/qUnmzG+snWunIaL/FoKGZKYIA5jmb1c4u9otXL2/z7O62+s+fWFZ3Om8dKQW00e8FPma93Zvq1w7MQrLVZoc6v3n9o1lbpCW6WoM0kz9NhoDP58v33XladSWBzczcaRhiFBuZWuG7WwB3SfIL+G9KED58ZYR570bjQOtVAUjw49E0xyGN/0W+dDSgaML1SnvdCTSpRgKls3Zgf8qTFz0zXHBJ34DdCWxY4KZW5kc3RyZWFtCmVuZG9iagoKOCAwIG9iago8PAovU2l6ZSA5Ci9Sb290IDIgMCBSCi9JbmZvIDMgMCBSCi9GaWx0ZXIgL0ZsYXRlRGVjb2RlCi9UeXBlIC9YUmVmCi9MZW5ndGggNDIKL1cgWyAxIDIgMiBdCi9JbmRleCBbIDAgOSBdCj4+CnN0cmVhbQp4nGNgYPj/n4mBnYEBRDCCCCYQwQwiWBgZBBgYGBluAgmmZQwMAGJcA8oKZW5kc3RyZWFtCmVuZG9iagoKc3RhcnR4cmVmCjY3OAolJUVPRg=='
const pdf = Uint8Array.from(atob(pdf64), c => c.charCodeAt(0))

function copy() {
    let data = new Uint8Array(pdf.byteLength);
    data.set(pdf);
    return data
}

(async() => {
    const p1 = { data: copy() }, p2 = { data: copy() }

    console.log('task1 started')
    const task1 = pdfjsLib.getDocument(p1);
    await task1.promise
    console.log('task1 complete')
    /*no await*/ task1.destroy().then(() => console.log('task1 destroyed') )

    console.log('task2 started')
    const task2 = pdfjsLib.getDocument(p2);
    await task2.promise
    console.log('task2 complete')
})()
        </script>
    </head>
    <body></body>
</html>

Console output would almost always be this:

task1 started
task1 complete
task2 started
task1 destroyed

Meaning that destruction of task1 interferes with the creation of task2, which causes its promise to never resolve.

Debugging the code, it looks like this is essentially what's happening:

  1. destroy() on task1 is called, WorkerTransport of that task (with MessageHandler 'd0') sends termination message to the corresponding WebWorker's MessageHandler 'd0_worker' and awaits for 'd0' to receive response to continue destruction (api.js:2516)
  2. getDocument() is called, new task reuses the PDFWorker of the previous task (api.js:2283, api.js:363)
  3. New task sends the document 'd1' to the WebWorker's 'worker' and returns promise awaiting on previous task's 'main' MessageHandler (the one inside PDFWorker) to receive the name of the new WebWorker's MessageHandler (would be 'd1_worker') (api.js:410, api.js:456)
  4. WebWorker's 'd0_worker' receives termination message from step 1, terminates, and sends back the response (worker.js:791). Then WebWorker's 'worker' receives message from step 3, creates MessageHandler 'd1_worker', and sends its name back (worker.js:90)
  5. 'd0' receives the response (termination successful) from 'd0_worker', removes its callback from the WebWorker, then the PDFWorker is destroyed, which in turn removes its 'main' callback from the WebWorker. The destruction completes (api.js:633, api.js:2270)
  6. The second message from step 4 (with the name of the new WebWorker MessageHandler) is never received because all of the callbacks were removed from WebWorker, thus promise from step 3 is never resolved

In conclusion: PDFWorker callbacks responsible for continuing execution of new task's promise are removed before they could do this.

I also want to note that the 'main' MessageHandler destroys with nonempty callbackCapabilities, which probably doesn't happen normally, so maybe this would be a place where the check with warning of such behavior could be added (shared/message_handler.js:531)

Adding my own event listener to the WebWorker

p.addEventListener('message', it => console.log(it.data.sourceName, '->', it.data.targetName + ':', (it.data.cation || it.data.data)))

results in the following:

task1 started
worker -> main: null             # WebWorker is ready event
worker -> main: d0_worker        # 'd0_worker' is created
d0_worker -> d0: {length: 883}
d0_worker -> d0: {pdfInfo: {…}}
task1 complete
task2 started
d0_worker -> d0: undefined       # 'd0_worker' finished terminating
task1 destroyed                  # PDFWorker listener is removed
worker -> main: d1_worker        # 'd1_worker' event is received on the application side but not processed

Sidenote: the only thing I could not figure out is where the message with the new WebWorker MessageHandler's name goes by default. If I add the listener after the task1 finished destroying, it is not called, meaning that the message is effectively lost. It looks like the WebWorker messages are not queued, and if no listeners are present, the messages are actually lost, because this:

const blob = new Blob([`postMessage('lost'); setTimeout(() => postMessage('received'), 1000)`], { type: "text/javascript" })
const worker = new Worker(window.URL.createObjectURL(blob));
setTimeout(() => worker.addEventListener('message', e => console.log(e.data)), 500)

produces just received and not lost received

Going back to the main issue, it can occur only if these conditions are met:

  1. Multiple PDFDocumentLoadingTasks exist at the same time
  2. Multiple PDFDocumentLoadingTasks use the same WebWorker

I tried to look in the examples, docs, and api file but couldn't find any information supporting or disallowing having multiple tasks at the same time.

Code meeting only one of these conditions, as far as I tested, works fine: Using the same WebWorker for multiple tasks (with workerPort) seems to work fine with only one working task at a time. Creating multiple tasks with different workers also works fine and does not give any warnings or errors:

const task1 = pdfjsLib.getDocument(p1)
const task2 = pdfjsLib.getDocument(p2)
const pdf1 = task1.promise.then(it => it.getPage(1)).then(it => it.getTextContent())
const pdf2 = task2.promise.then(it => it.getPage(1)).then(it => it.getTextContent())
console.log(await pdf1, await pdf2)

Also, looking on the internet (mostly stackoverflow) it looks like nobody is awaiting on destroy() which could lead to condition 1.

The only hint that 1 is not allowed I found is that the more advanced applications (app.js and mobile-viewer) always wait on task destruction before creating the next one.

I hope I wrote the line numbers correctly and didn't miss some text in some obvious place that says that 1 and/or 2 are not allowed to happen.

Snuffleupagus commented 1 year ago

awaiting on destroy() or not using workerPort

Either of those seem like the "correct" solution here, depending on your exact situation. Please note that the method is documented as being asynchronous, hence having to await its completion may perhaps be somewhat expected? https://github.com/mozilla/pdf.js/blob/0725b6299f122632a7d5b97f82fe584ae1584c5d/src/display/api.js#L622-L636

Please also note that for performance reasons it's not really recommended to parse more than one document at a time in each worker-thread, since otherwise multiple PDF documents would unnecessarily "compete" for parsing resources which will slow things down.

Having (quickly) read through https://github.com/mozilla/pdf.js/issues/16777#issue-1829956884 it's unfortunately not immediately clear (at least to me) if there's a simple solution to fix the current situation. To ensure maintainability of the relevant code, we want to avoid introducing too much complexity in this part of the API.

Please note that GlobalWorkerOptions.workerPort is quite old functionality, and this is the first time that this particular issue has been reported[1]. While we could probably let getDocument throw if the GlobalWorkerOptions.workerPort is being used by another (non-destroyed) document, my worry is that we'd end up breaking things for more users in that way.


[1] Possibly because most users don't utilize GlobalWorkerOptions.workerPort, or at least not in parallel like this.

Snuffleupagus commented 1 year ago

Possibly we could delay destruction of the global workerPort until all its associated loadingTasks are actually done, e.g. something along these lines, however:

vanaigr commented 1 year ago

I accidentally pressed the wrong button and closed the issue, sorry.

After debugging the code, I presumed that there probably wouldn't be any easy fix for this, so I tried to spot any place for a warning, and MessageHandler destroying with callbackCapabilities from the new task looked like a good candidate, because, under normal circumstances, I presume they should be empty. But I am not very familiar with the code, so if it's not the case or adding a warning here wouldn't worth it, then we can close this issue.

Also, in my defense of not awaiting for destroy(), when I first saw this function called I looked into the documentation to see if it even existed, noticed that it returned Promise, and, yes, assumed that I should await its completion. But quickly looking on the internet for small examples of its usage, I didn't find anyone awaiting and assumed that I don't need to.

Snuffleupagus commented 1 year ago

After debugging the code, I presumed that there probably wouldn't be any easy fix for this,

Well, I suppose that it depends on your definition of "easy fix"; please note that https://github.com/mozilla/pdf.js/issues/16777#issuecomment-1664223730 contains a link to the smallest patch that I could imagine here: https://github.com/mozilla/pdf.js/compare/master...Snuffleupagus:pdf.js:issue-16777

so I tried to spot any place for a warning, and MessageHandler destroying with callbackCapabilities from the new task looked like a good candidate,

First of all, based on experience it's very difficult to write warning messages that are clear and concise enough that users will immediately understand what's wrong and more importantly how to fix things. Secondly, I don't believe that the code (or even the file) that you're referencing would be an appropriate spot for a warning message unfortunately given the "special" situation described in this issue. (Also, note there's streamSinks to consider as well.)

Snuffleupagus commented 1 year ago

@timvandermeij How do you feel about the patch I linked in https://github.com/mozilla/pdf.js/issues/16777#issuecomment-1664318010, since as mentioned it does add some (possibly unwanted) complexity?

timvandermeij commented 1 year ago

I think the patch looks reasonable. However, destroy is marked as asynchronous, so completion guarantees can only be given if it's awaited, and if it's awaited I don't think there is actually an issue here? I would say awaiting the promise is the expected usage here, so I can't 100% oversee if there are no edge cases if one doesn't await, which makes me want to prefer simply awaiting (and possibly e.g. extending relevant documentation if possible).

But quickly looking on the internet for small examples of its usage, I didn't find anyone awaiting and assumed that I don't need to.

I think those examples work "accidentally" then, and this problem most likely won't be visible if no second document is loaded at the same time, which most users likely won't do.