unjs / unpdf

📄 Utilities to work with PDFs in Node.js, browser and workers
MIT License
438 stars 12 forks source link

Strange behavior of `getDocumentProxy`'s buffer when extracting text AND rendering page as image (only for some pdf) #17

Open ndrbrt opened 1 month ago

ndrbrt commented 1 month ago

Environment

node v20.11.1 unpdf v0.11.0

Reproduction

I got the original error in a server route of a Nuxt 3 project. Also, in the original app I performed other operations besides text/metadata extraction and image rendering.

Anyway, I prepared a new Nitro project for this issue and isolated only the error involved. You can find the repo here: https://github.com/ndrbrt/unpdf-issue

Describe the bug

First of all, I noticed the issue only for some pdfs (actually pdfs with images, but I don't know if it's something comparable to #4, nor if it only affects pdfs with images).

Error A

The original code was similar to that in server/api/error-a.ts.

If you run the dev server and open, e.g.:

You get the following error:

[nitro] [request error] [unhandled] Cannot read properties of undefined (reading 'createCanvas')
  at i.constructor._createCanvas (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1552904)
  at i.constructor.create (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1399305)
  at CachedCanvases.getCanvas (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1474861)
  at CanvasGraphics.beginGroup (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1502437)
  at CanvasGraphics.executeOperatorList (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1482511)
  at InternalRenderTask._next (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1591245)
  at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

However, as I said, if you pass some other pdfs, everything's fine, e.g.:

Working version

Now, the only way I was able to solve the problem is as in server/api/working.ts: I copied the original buffer before it was passed to getDocumentProxy and then passed the copied buffer to renderPageAsImage. You can see that both requests succeed:

Error B

I also tried another approach in server/api/error-b.ts, passing a new Uint8Array(buffer) directly to renderPageAsImage. This way, if you open:

You get this error:

[nitro] [request error] [unhandled] Unable to deserialize cloned data.
  at LoopbackPort.postMessage (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1573782)
  at MessageHandler.sendWithPromise (./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1514035)
  at ./node_modules/.pnpm/unpdf@0.11.0/node_modules/unpdf/dist/pdfjs.mjs:1:1561726
  at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Interestingly, in this case, if you repeat the request disabling text extraction (note the query param), it works:

Additional context

I did not use the official PDF.js build, because I couldn't get it to work. I still tried using the default build from unpdf and everything worked fine until I noticed the mentioned problem.

Logs

No response