opengovsg / pdf2md

A PDF to Markdown converter
https://www.npmjs.com/package/@opendocsg/pdf2md
MIT License
195 stars 39 forks source link

Set pdf worker using GlobalWokerOptions.workerSrc in pdf.js #65

Closed marcelpanse closed 1 year ago

marcelpanse commented 1 year ago

Problem

Error: Setting up fake worker failed: \"No \"GlobalWorkerOptions.workerSrc\" specified.\"."

Closes #64

Solution

Setting the workerSrc in lib/util/pdf.js

LoneRifle commented 1 year ago

A quick googling shows that there are multiple possible options to set workerSrc to.

I'll close this for now, but I look forward to another PR that tries some of the approaches specified in mozilla/pdf.js#10478

marcelpanse commented 1 year ago

This was that approach suggested in that issue: https://github.com/mozilla/pdf.js/issues/10478#issuecomment-518673665 So I'm not sure what other approaches you are referring to?

LoneRifle commented 1 year ago

Could you consider https://github.com/mozilla/pdf.js/issues/10478#issuecomment-1624832454 instead?

marcelpanse commented 1 year ago

No that seems incorrect. The code he references returns the required worker, but doesn't assign it, so just importing it won't do anything, you have to assign it to the GlobalWorkerOptions.workerSrc otherwise it won't do anything.

The first comment on the issue also explains that: https://github.com/mozilla/pdf.js/issues/10478#issuecomment-456061921

LoneRifle commented 1 year ago

No that seems incorrect. The code he references returns the required worker, but doesn't assign it, so just importing it won't do anything, you have to assign it to the GlobalWorkerOptions.workerSrc otherwise it won't do anything.

The first comment on the issue also explains that: mozilla/pdf.js#10478 (comment)

See https://github.com/mozilla/pdf.js/issues/10478#issuecomment-1560704162 . tldr, the import will establish the fallback mechanism that pdf.js uses in the event that GlobalWorkerOptions.workerSrc is missing.

marcelpanse commented 1 year ago

Ah yes, if you have a window object. I'm running in Node, there is no window object, that was why it wasn't working in the first place :-)

LoneRifle commented 1 year ago

That's odd, since I'm also running this on Node, and don't observe the problems you have. Do you have a reproducible test case? Which version of node are you using?

marcelpanse commented 1 year ago

I'm using node 18 on OSX (arm64). I can't reproduce it in a standalone node app, I think it has something to do with the fact we compile with esbuild. Esbuild might remove some implicit dependencies. I'll look into it more later when I have some more time.

marcelpanse commented 1 year ago

Yeah so the problem is esbuild. I guess the default fallback mechanism that dynamically requires some file doesn't work because esbuild tree-shakes everything away.

Here is the example with esbuild that doesn’t work: https://codesandbox.io/p/sandbox/proud-silence-yn6mhx?file=%2Fpackage.json%3A9%2C5

This is the same code but with my PR that fixes it: https://codesandbox.io/p/sandbox/pdf2md-fork-zqq6j5?file=%2Findex.js%3A10%2C45

Disclaimer: I'm literally using pdf.js for 1 day, so I'm far from an expert or capable of judging if this is the best approach ;-)