opengovsg / pdf2md

A PDF to Markdown converter
https://www.npmjs.com/package/@opendocsg/pdf2md
MIT License
216 stars 41 forks source link

Support Webpack, don't use `require.resolve()` #76

Open quintesse opened 1 year ago

quintesse commented 1 year ago

Is your feature request related to a problem? Please describe.

Because of the use of the function require.resolve() in this code: https://github.com/opengovsg/pdf2md/blob/master/lib/util/pdf.js#L19 the use of pdf2md in my project results in errors in production builds at runtime. The project uses WebPack to bundle the code and that particular function cannot be used.

An explanation of the problem can be found in this Webpack issue: https://github.com/webpack/webpack/issues/13931

(It's basically because once a project has been bundled, asking a module for it's filename just doesn't make sense anymore)

Describe the solution you'd like

I'd like the code to either not use that method by using an alternative or provide some kind of option/flag that avoids its use. Also, given the fact that this project is about converting a PDF file to Markdown, which is text-only, is providing font directories even useful? Would it perhaps be possible to do without?

Describe alternatives you've considered

The work-around mentioned in the Webpack issue fortunately works. It's a completely incidental fix that reduces runtime code efficiency, so I'd prefer not to use it.

LoneRifle commented 10 months ago

If you are using Next.js, another workaround suggested in https://github.com/opengovsg/pdf2md/issues/69#issuecomment-1884786365 also works: in brief, specify @opendocsg/pdf2md and pdfjs-dist under experimental.serverComponentsExternalPackages in your Next config.

LoneRifle commented 10 months ago

is providing font directories even useful? Would it perhaps be possible to do without?

I might be wrong, but pdfjs may be using the fonts to better recognise text within the PDF documents. I would be happy for you (or someone else) to see if removing the font directory would have any impact on the conversion process! If it doesn't, I'll gladly accept a PR.

quintesse commented 10 months ago

If you are using Next.js, another workaround suggested in #69 (comment) also works: in brief, specify @opendocsg/pdf2md and pdfjs-dist under experimental.serverComponentsExternalPackages in your Next config.

Can confirm that this solution seems to work as well.

LoneRifle commented 1 week ago

Please try this with v0.2.0 of this package, which removes the dependency on node-specific libs, including path.

quintesse commented 1 week ago

@LoneRifle I haven't been able to take a good look at it, but this is the error I get by simply updating to the latest version and removing the work-around:

Failed to compile.

./node_modules/unpdf/dist/index.cjs
Module not found: Package path ./pdfjs is not exported from package C:\Projects\test\node_modules\unpdf (see exports field in C:\Projects\test\node_modules\unpdf\package.json)

https://nextjs.org/docs/messages/module-not-found

Import trace for requested module:
./node_modules/@opendocsg/pdf2md/lib/util/pdf.js
./node_modules/@opendocsg/pdf2md/lib/pdf2md.js
...

> Build failed because of webpack errors

If I re-add the work-around (the one where you add serverComponentsExternalPackages to the next.config.js file) it at least compiles. I haven't had time yet to test if it still works.