vercel / next.js

The React Framework
https://nextjs.org
MIT License
124.43k stars 26.55k forks source link

Import pdfjs-dist not working correctly #58313

Closed Luluno01 closed 7 months ago

Luluno01 commented 9 months ago

Link to the code that reproduces this issue

https://github.com/Luluno01/pdfjs-dist-import-reproducer

To Reproduce

  1. Start the application in development mode (next dev)
  2. Open home page (/)
  3. Got an error in dev server console "Attempted import error: 'getDocument' is not exported from 'pdfjs-dist' (imported as 'pdfjs')." and getDocument being undefined.

Current vs. Expected behavior

The ESM package pdfjs-dist should be imported correctly. The actual outcome, however, is nothing will be imported -- all exported objects are undefined, including the default export.

Verify canary release

Provide environment information

Operating System:
  Platform: win32
  Arch: x64
  Version: Windows 11 Pro
Binaries:
  Node: 21.1.0
  npm: N/A
  Yarn: N/A
  pnpm: N/A
Relevant Packages:
  next: 14.0.3-canary.1
  eslint-config-next: N/A
  react: 18.2.0
  react-dom: 18.2.0
  typescript: 5.1.3
Next.js Config:
  output: N/A

Which area(s) are affected? (Select all that apply)

App Router, TypeScript (plugin, built-in types)

Additional context

Same problem with version "13.5.4" and version "13.0.0".

Luluno01 commented 9 months ago

Note that importing pdfjs-dist directly in a plain JS file without any bundler works without a problem.

https://github.com/Luluno01/pdfjs-dist-import-reproducer/blob/main/expected.js

Luluno01 commented 9 months ago

Looks like it has something to do with webpack. I found a temporary workaround, which is using dynamic import() with /*webpackIgnore: true*/. Not sure if this is a good practice but at least it works locally.

AChangXD commented 9 months ago

@Luluno01 I ran into this issue just now, and yeah the imports shows up as undefined for me as well (I'm on version 14)

AChangXD commented 9 months ago

@Luluno01 I tried using import() and is running into the same error. Can you post a snippet on how you are importing/calling the function? (Note I'm using it in an API request)

My code: `const pdfJs = await import('pdfjs-dist');

export async function POST(req: Request, res: Response) { console.log(typeof pdfJs); console.log(typeof pdfJs.getDocument);`

AChangXD commented 9 months ago

@Luluno01 Even though typeof shows that it's an object, an empty array will show up if you try to print Object.keys(pdfJS)

AChangXD commented 9 months ago

I even added https://github.com/mozilla/pdfjs-dist manually into my project, same error, something to do with the imports for sure

Luluno01 commented 9 months ago

@Luluno01 I tried using import() and is running into the same error. Can you post a snippet on how you are importing/calling the function? (Note I'm using it in an API request)

My code: `const pdfJs = await import('pdfjs-dist');

export async function POST(req: Request, res: Response) { console.log(typeof pdfJs); console.log(typeof pdfJs.getDocument);`

I did the same as what you did and got the same result. Then I added a magic comment /* webpackIgnore: true */ inside the import statement to prevent webpack from recursing into it and bundling nothing. It turns out, however, forcing webpack to ignoring the dynamic import will not work after deploying to vercel because it doesn't ship node_modules at all.

Luluno01 commented 9 months ago

I even added https://github.com/mozilla/pdfjs-dist manually into my project, same error, something to do with the imports for sure

Found this. Not a big fan of webpack but I tried to follow the settings in the example provided. Still no luck.

AChangXD commented 9 months ago

I even added https://github.com/mozilla/pdfjs-dist manually into my project, same error, something to do with the imports for sure

Found this. Not a big fan of webpack but I tried to follow the settings in the example provided. Still no luck.

I created a brand new node project and everything works, so this is an issue with how next.js/webpack bundle the different modules.

AChangXD commented 9 months ago

Interesting thing is I think everything works in the pages router

Luluno01 commented 9 months ago

I even added https://github.com/mozilla/pdfjs-dist manually into my project, same error, something to do with the imports for sure

Found this. Not a big fan of webpack but I tried to follow the settings in the example provided. Still no luck.

I created a brand new node project and everything works, so this is an issue with how next.js/webpack bundle the different modules.

Interesting thing is I think everything works in the pages router

You mean so far it ONLY works in pages router?

malikiz commented 9 months ago

Try importing like this:

import * as PDFJS from 'pdfjs-dist/build/pdf.min.mjs'
Luluno01 commented 9 months ago

Try importing like this:

import * as PDFJS from 'pdfjs-dist/build/pdf.min.mjs'

Interesting, it does make some difference, but results in another error. The result is the same as installing and importing the CommonJS version directly from the repo. While it no longer imports nothing, the library complains:

Error: Setting up fake worker failed: "Cannot find module './pdf.worker.mjs'".

According to the official example, we should add pdf.worker as an entry to split it as a separate chunk after packing by webpack. Unfortunately, I run into a webpack error "Error: Entry pdf.worker depends on main, but this entry was not found" after adding the entry pdf.worker. Not sure why it depends on "main" and what "main" is supposed to be. Would you mind sharing a minimal working example of next.config.js?

AChangXD commented 9 months ago

I got the worker error as well, I think 'import {getDocument} from 'pdfjs-dist'' is the official recommended way? Re webpack splitting, I have not the slightest clue lol, never really messed around with it before. Really hate to split this pdf processing into it's own microservice lol

Luluno01 commented 9 months ago

Interesting thing is I think everything works in the pages router

Interesting. I'm pretty sure it has everything to do with webpack. But I'm not familiar with webpack stuff... Still struggling to figure out how to configure webpack to make it work with app router.

Luluno01 commented 9 months ago

I got the worker error as well, I think 'import {getDocument} from 'pdfjs-dist'' is the official recommended way? Re webpack splitting, I have not the slightest clue lol, never really messed around with it before. Really hate to split this pdf processing into it's own microservice lol

Yeah, me too. I ended up reimplementing the PDF processing API endpoint with Cloud Functions, which doesn't use a bundler but runs directly the compiled code of TypeScript (or your JS code as-is). Really ugly workaround.

Luluno01 commented 9 months ago

I got the worker error as well, I think 'import {getDocument} from 'pdfjs-dist'' is the official recommended way? Re webpack splitting, I have not the slightest clue lol, never really messed around with it before. Really hate to split this pdf processing into it's own microservice lol

If I still remember my experiments correctly, import { getDocument } from '...' results in undefined no matter if you import it from 'pdfjs-dist' or 'pdfjs-dist/build/pdf.min.mjs'. Only import * as pdfjs from '...' gets a chance to work.

AChangXD commented 9 months ago

also tried raw-loader as suggested by some,

I got the worker error as well, I think 'import {getDocument} from 'pdfjs-dist'' is the official recommended way? Re webpack splitting, I have not the slightest clue lol, never really messed around with it before. Really hate to split this pdf processing into it's own microservice lol

Yeah, me too. I ended up reimplementing the PDF processing API endpoint with Cloud Functions, which doesn't use a bundler but runs directly the compiled code of TypeScript (or your JS code as-is). Really ugly workaround.

Going have to do the same thing, I think the team at Vercel should also look at other libraries with pdfjs-dist as a dependency, I was using pdf-to-png-converter. I did see something about using raw-loader and it didn't seem to have done anything? Here's my webpack config /* @type {import('next').NextConfig} /

const nextConfig = {
  experimental: {
    esmExternals: true,
  },
  webpack: (config) => {
    config.module.rules.push({
      test: /\.node/,
      use: 'raw-loader',
    });
    config.resolve.alias.canvas = false;
    config.resolve.alias.encoding = false;
    return config;
  },
};

export default nextConfig;

Also could you link the doc where pdf.worker needs to be split into its own chunk?

AChangXD commented 9 months ago

I think it's also important to clarify that pdfjs-dist could be used in BOTH React and any API routes, not sure if that causes any difference in behavior.

Luluno01 commented 9 months ago

also tried raw-loader as suggested by some,

I got the worker error as well, I think 'import {getDocument} from 'pdfjs-dist'' is the official recommended way? Re webpack splitting, I have not the slightest clue lol, never really messed around with it before. Really hate to split this pdf processing into it's own microservice lol

Yeah, me too. I ended up reimplementing the PDF processing API endpoint with Cloud Functions, which doesn't use a bundler but runs directly the compiled code of TypeScript (or your JS code as-is). Really ugly workaround.

Going have to do the same thing, I think the team at Vercel should also look at other libraries with pdfjs-dist as a dependency, I was using pdf-to-png-converter. I did see something about using raw-loader and it didn't seem to have done anything? Here's my webpack config /* @type {import('next').NextConfig} /

const nextConfig = {
  experimental: {
    esmExternals: true,
  },
  webpack: (config) => {
    config.module.rules.push({
      test: /\.node/,
      use: 'raw-loader',
    });
    config.resolve.alias.canvas = false;
    config.resolve.alias.encoding = false;
    return config;
  },
};

export default nextConfig;

Also could you link the doc where pdf.worker needs to be split into its own chunk?

I just found that pdf.worker actually doesn't need to be split into a separate chunk. I looked into the webpack.config.js of the official example, which declares an entry that points to the worker source file. That's why I thought it would pass the file path to a real Worker constructor. Since Worker expects a path to a real file, the worker source file should be bundled as a separate chunk.

Later I inspected the source code: https://github.com/mozilla/pdfjs-dist/blob/master/build/pdf.js#L2031. it is const worker = await import(/* webpackIgnore: true */ this.workerSrc); in the npm distributed version, and await import(this.workerSrc) (without the magic comment) in the minified pdf.min.mjs. So it seems in Node.js environment, the worker is imported into the main thread with dynamic import instead of started as a worker thread.

Luluno01 commented 9 months ago

I think it's also important to clarify that pdfjs-dist could be used in BOTH React and any API routes, not sure if that causes any difference in behavior.

Yes, you are right. And my use case is server-side PDF file processing.

Luluno01 commented 9 months ago

Okay, I managed to get it work by adding an ugly hint for webpack: await import('pdfjs-dist/build/pdf.worker.mjs') after importing with import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs'. Confirmed to work by deploying on Vercel. I'm adding a new branch to the reproducer...

@AChangXD

AChangXD commented 9 months ago

Okay, I managed to get it work by adding an ugly hint for webpack: await import('pdfjs-dist/build/pdf.worker.mjs') after importing with import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs'. Confirmed to work by deploying on Vercel. I'm adding a new branch to the reproducer...

@AChangXD

Interesting, I can't get 'pdfjs-dist/build/pdf.min.mjs to import without TS complaining. With //@ts-ignore, I get Attempted import error: 'getDocument' is not exported from 'pdfjs-dist/build/pdf.mjs' (imported as 'pdfjs').

AChangXD commented 9 months ago

Okay, I managed to get it work by adding an ugly hint for webpack: await import('pdfjs-dist/build/pdf.worker.mjs') after importing with import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs'. Confirmed to work by deploying on Vercel. I'm adding a new branch to the reproducer...

@AChangXD

I'll try your workaround when you add the new branch, in the meantime I'm going to see if it works in create-t3-app and trpc

Luluno01 commented 9 months ago

Okay, I managed to get it work by adding an ugly hint for webpack: await import('pdfjs-dist/build/pdf.worker.mjs') after importing with import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs'. Confirmed to work by deploying on Vercel. I'm adding a new branch to the reproducer... @AChangXD

Interesting, I can't get 'pdfjs-dist/build/pdf.min.mjs to import without TS complaining. With //@ts-ignore, I get Attempted import error: 'getDocument' is not exported from 'pdfjs-dist/build/pdf.mjs' (imported as 'pdfjs').

Just add declare module 'pdfjs-dist/build/pdf.min.mjs' { export * from 'pdfjs-dist' } to get TypeScript working again.

Luluno01 commented 9 months ago

Okay, I managed to get it work by adding an ugly hint for webpack: await import('pdfjs-dist/build/pdf.worker.mjs') after importing with import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs'. Confirmed to work by deploying on Vercel. I'm adding a new branch to the reproducer... @AChangXD

I'll try your workaround when you add the new branch, in the meantime I'm going to see if it works in create-t3-app and trpc

Here you are: https://github.com/Luluno01/pdfjs-dist-import-reproducer/commit/82c44393cd1edac7b58264bc56d26b85020d82c5

AChangXD commented 9 months ago

Okay, I managed to get it work by adding an ugly hint for webpack: await import('pdfjs-dist/build/pdf.worker.mjs') after importing with import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs'. Confirmed to work by deploying on Vercel. I'm adding a new branch to the reproducer... @AChangXD

I'll try your workaround when you add the new branch, in the meantime I'm going to see if it works in create-t3-app and trpc

Here you are: Luluno01/pdfjs-dist-import-reproducer@82c4439

OMG you are a genius!! I added an API endpoint and it also works:

import { NextResponse } from 'next/server';
import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs';
await import('pdfjs-dist/build/pdf.worker.min.mjs');

export async function POST(req: Request, res: Response) {
  const pdf = await pdfjs.getDocument(
    'https://www.africau.edu/images/default/sample.pdf'
  ).promise;
  const page = await pdf.getPage(1);
  const textContent = await page.getTextContent();
  return NextResponse.json({ message: textContent }, { status: 200 });
}

On my end it does give me a warning about a font issue, not sure if it's an import related issue but I'm getting me results! Warning: fetchStandardFontData: failed to fetch file "LiberationSans-Regular.ttf" with "UnknownErrorException: The standard font "baseUrl" parameter must be specified, ensure that the "standardFontDataUrl" API parameter is provided.".

Thanks a lot!

AChangXD commented 9 months ago

Also for future folks who may stumble on this error message when using another package that depends on pdfjs-dist: { message: 'The API version "3.11.174" does not match the Worker version "4.0.189".', name: 'UnknownErrorException', details: 'Error: The API version "3.11.174" does not match the Worker version "4.0.189".' } - You'd have to uninstall pdfjs-dist and install the correct version (3.11.174) in this case.

Luluno01 commented 9 months ago

Okay, I managed to get it work by adding an ugly hint for webpack: await import('pdfjs-dist/build/pdf.worker.mjs') after importing with import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs'. Confirmed to work by deploying on Vercel. I'm adding a new branch to the reproducer... @AChangXD

I'll try your workaround when you add the new branch, in the meantime I'm going to see if it works in create-t3-app and trpc

Here you are: Luluno01/pdfjs-dist-import-reproducer@82c4439

OMG you are a genius!! I added an API endpoint and it also works:

import { NextResponse } from 'next/server';
import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs';
await import('pdfjs-dist/build/pdf.worker.min.mjs');

export async function POST(req: Request, res: Response) {
  const pdf = await pdfjs.getDocument(
    'https://www.africau.edu/images/default/sample.pdf'
  ).promise;
  const page = await pdf.getPage(1);
  const textContent = await page.getTextContent();
  return NextResponse.json({ message: textContent }, { status: 200 });
}

On my end it does give me a warning about a font issue, not sure if it's an import related issue but I'm getting me results! Warning: fetchStandardFontData: failed to fetch file "LiberationSans-Regular.ttf" with "UnknownErrorException: The standard font "baseUrl" parameter must be specified, ensure that the "standardFontDataUrl" API parameter is provided.".

Thanks a lot!

Yeah I'm also getting some bizarre warnings. I guess although this workaround is unstable and not recommended. While this workaround works in my other project after deploying, it fails in the deployment of the exact workaround branch. And the error is even more bizarre - it is a segment fault that happens only in the deployment with 0 stack trace.

AChangXD commented 9 months ago

Okay, I managed to get it work by adding an ugly hint for webpack: await import('pdfjs-dist/build/pdf.worker.mjs') after importing with import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs'. Confirmed to work by deploying on Vercel. I'm adding a new branch to the reproducer... @AChangXD

I'll try your workaround when you add the new branch, in the meantime I'm going to see if it works in create-t3-app and trpc

Here you are: Luluno01/pdfjs-dist-import-reproducer@82c4439

OMG you are a genius!! I added an API endpoint and it also works:

import { NextResponse } from 'next/server';
import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs';
await import('pdfjs-dist/build/pdf.worker.min.mjs');

export async function POST(req: Request, res: Response) {
  const pdf = await pdfjs.getDocument(
    'https://www.africau.edu/images/default/sample.pdf'
  ).promise;
  const page = await pdf.getPage(1);
  const textContent = await page.getTextContent();
  return NextResponse.json({ message: textContent }, { status: 200 });
}

On my end it does give me a warning about a font issue, not sure if it's an import related issue but I'm getting me results! Warning: fetchStandardFontData: failed to fetch file "LiberationSans-Regular.ttf" with "UnknownErrorException: The standard font "baseUrl" parameter must be specified, ensure that the "standardFontDataUrl" API parameter is provided.". Thanks a lot!

Yeah I'm also getting some bizarre warnings. I guess although this workaround is unstable and not recommended. While this workaround works in my other project after deploying, it fails in the deployment of the exact workaround branch. And the error is even more bizarre - it is a segment fault that happens only in the deployment with 0 stack trace.

Running a build on Vercel right now, will see if it fails on my end too. Do you happen to also use tesseract.js for OCR? That import is giving me hell as well :(

AChangXD commented 9 months ago

@Luluno01 So building locally works perfectly, building on Vercel gives me this:


> Build error occurred
--
13:14:03.947 | Error: Collecting page data for undefined is still timing out after 2 attempts. See more info here https://nextjs.org/docs/messages/page-data-collection-timeout
13:14:03.954 | at onRestart (/vercel/path0/node_modules/next/dist/build/index.js:762:39)
13:14:03.954 | at Worker.isPageStatic (/vercel/path0/node_modules/next/dist/lib/worker.js:95:40)
13:14:03.954 | at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
13:14:03.954 | at async Span.traceAsyncFn (/vercel/path0/node_modules/next/dist/trace/trace.js:140:20)
13:14:03.954 | at async /vercel/path0/node_modules/next/dist/build/index.js:959:56
13:14:03.954 | at async Span.traceAsyncFn (/vercel/path0/node_modules/next/dist/trace/trace.js:140:20)
13:14:03.954 | at async Promise.all (index 4)
13:14:03.955 | at async /vercel/path0/node_modules/next/dist/build/index.js:892:17
13:14:03.955 | at async Span.traceAsyncFn (/vercel/path0/node_modules/next/dist/trace/trace.js:140:20)
13:14:03.955 | at async /vercel/path0/node_modules/next/dist/build/index.js:829:124
13:14:04.001 | Error: Command "npm run build" exited with 1
Luluno01 commented 9 months ago

@Luluno01 So building locally works perfectly, building on Vercel gives me this:


> Build error occurred
--
13:14:03.947 | Error: Collecting page data for undefined is still timing out after 2 attempts. See more info here https://nextjs.org/docs/messages/page-data-collection-timeout
13:14:03.954 | at onRestart (/vercel/path0/node_modules/next/dist/build/index.js:762:39)
13:14:03.954 | at Worker.isPageStatic (/vercel/path0/node_modules/next/dist/lib/worker.js:95:40)
13:14:03.954 | at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
13:14:03.954 | at async Span.traceAsyncFn (/vercel/path0/node_modules/next/dist/trace/trace.js:140:20)
13:14:03.954 | at async /vercel/path0/node_modules/next/dist/build/index.js:959:56
13:14:03.954 | at async Span.traceAsyncFn (/vercel/path0/node_modules/next/dist/trace/trace.js:140:20)
13:14:03.954 | at async Promise.all (index 4)
13:14:03.955 | at async /vercel/path0/node_modules/next/dist/build/index.js:892:17
13:14:03.955 | at async Span.traceAsyncFn (/vercel/path0/node_modules/next/dist/trace/trace.js:140:20)
13:14:03.955 | at async /vercel/path0/node_modules/next/dist/build/index.js:829:124
13:14:04.001 | Error: Command "npm run build" exited with 1

Same with my reproducer. That's why I moved it to /api/.... The static page generation will somehow timeout.

Luluno01 commented 9 months ago

I installed the exact versions of next, pdfjs-dist and canvas in the reproducer as used by the other project of mine that magically works. It doesn't help ruling out the segment fault, though. Also, the bundle size of the endpoint that uses pdfjs-dist grows to 50 MB already, which is close to the size limit posed by Vercel. Considering it is very likely that I will add more functionalities to the endpoint, I guess I had better stay with the good old solution - turning to Google Cloud Functions.

Luluno01 commented 9 months ago

Okay, I managed to get it work by adding an ugly hint for webpack: await import('pdfjs-dist/build/pdf.worker.mjs') after importing with import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs'. Confirmed to work by deploying on Vercel. I'm adding a new branch to the reproducer... @AChangXD

I'll try your workaround when you add the new branch, in the meantime I'm going to see if it works in create-t3-app and trpc

Here you are: Luluno01/pdfjs-dist-import-reproducer@82c4439

OMG you are a genius!! I added an API endpoint and it also works:

import { NextResponse } from 'next/server';
import * as pdfjs from 'pdfjs-dist/build/pdf.min.mjs';
await import('pdfjs-dist/build/pdf.worker.min.mjs');

export async function POST(req: Request, res: Response) {
  const pdf = await pdfjs.getDocument(
    'https://www.africau.edu/images/default/sample.pdf'
  ).promise;
  const page = await pdf.getPage(1);
  const textContent = await page.getTextContent();
  return NextResponse.json({ message: textContent }, { status: 200 });
}

On my end it does give me a warning about a font issue, not sure if it's an import related issue but I'm getting me results! Warning: fetchStandardFontData: failed to fetch file "LiberationSans-Regular.ttf" with "UnknownErrorException: The standard font "baseUrl" parameter must be specified, ensure that the "standardFontDataUrl" API parameter is provided.". Thanks a lot!

Yeah I'm also getting some bizarre warnings. I guess although this workaround is unstable and not recommended. While this workaround works in my other project after deploying, it fails in the deployment of the exact workaround branch. And the error is even more bizarre - it is a segment fault that happens only in the deployment with 0 stack trace.

Running a build on Vercel right now, will see if it fails on my end too. Do you happen to also use tesseract.js for OCR? That import is giving me hell as well :(

Not yet. But I might have to use it soon LOL

AChangXD commented 9 months ago

I installed the exact versions of next, pdfjs-dist and canvas in the reproducer as used by the other project of mine that magically works. It doesn't help ruling out the segment fault, though. Also, the bundle size of the endpoint that uses pdfjs-dist grows to 50 MB already, which is close to the size limit posed by Vercel. Considering it is very likely that I will add more functionalities to the endpoint, I guess I had better stay with the good old solution - turning to Google Cloud Functions.

yeah I'll have to as well, or at least host a nodejs backend on Vercel, there was a change that pdfjs-dist introduced that ballooned the bundle size, I read it somewhere yestarday but can't remember where

AChangXD commented 9 months ago

I installed the exact versions of next, pdfjs-dist and canvas in the reproducer as used by the other project of mine that magically works. It doesn't help ruling out the segment fault, though. Also, the bundle size of the endpoint that uses pdfjs-dist grows to 50 MB already, which is close to the size limit posed by Vercel. Considering it is very likely that I will add more functionalities to the endpoint, I guess I had better stay with the good old solution - turning to Google Cloud Functions.

Also keep in mind that cloud functions has a 100MB limit as well, sadly. why can't there be a semi-decent pdf parsing library out there... So frustrating

Luluno01 commented 9 months ago

I installed the exact versions of next, pdfjs-dist and canvas in the reproducer as used by the other project of mine that magically works. It doesn't help ruling out the segment fault, though. Also, the bundle size of the endpoint that uses pdfjs-dist grows to 50 MB already, which is close to the size limit posed by Vercel. Considering it is very likely that I will add more functionalities to the endpoint, I guess I had better stay with the good old solution - turning to Google Cloud Functions.

yeah I'll have to as well, or at least host a nodejs backend on Vercel, there was a change that pdfjs-dist introduced that ballooned the bundle size, I read it somewhere yestarday but can't remember where

I guess it might have something to do with the transient dependency canvas. Although it is an optional dependency of pdfjs-dist, webpack decides it needs that package and it might be bundling the huge binaries of canvas.

Luluno01 commented 9 months ago

I installed the exact versions of next, pdfjs-dist and canvas in the reproducer as used by the other project of mine that magically works. It doesn't help ruling out the segment fault, though. Also, the bundle size of the endpoint that uses pdfjs-dist grows to 50 MB already, which is close to the size limit posed by Vercel. Considering it is very likely that I will add more functionalities to the endpoint, I guess I had better stay with the good old solution - turning to Google Cloud Functions.

Also keep in mind that cloud functions has a 100MB limit as well, sadly. why can't there be a semi-decent pdf parsing library out there... So frustrating

Cloud Functions has much relaxed restrictions as claimed here:

100MB (compressed) for sources. 500MB (uncompressed) for sources plus modules. (1st gen max deployment size) N/A (2nd gen max deployment size)

AChangXD commented 9 months ago

@Luluno01 I downgraded next to 13.5.6 and at least langchain's PDFLoader is working? I'm guessing they bundle the PDFLoader in a specific way that the other libraries don't?

Luluno01 commented 9 months ago

@Luluno01 I downgraded next to 13.5.6 and at least langchain's PDFLoader is working? I'm guessing they bundle the PDFLoader in a specific way that the other libraries don't?

I was testing with getDocument and none of canary, 13.5.6 or 13.5.4 works in the reproducer. My other project which runs next.js 13.5.4, however, works magically. I don't think it's a good idea to use that library in an unstable hacky way.

AChangXD commented 9 months ago

@Luluno01 So deployed my Node/Express backend on Vercel and got this as well: Unhandled Promise Rejection {"errorType":"Runtime.UnhandledPromiseRejection","errorMessage":"Error: Setting up fake worker failed: \"Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs\".","reason":{"errorType":"Error","errorMessage":"Setting up fake worker failed: \"Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs\".","stack":["Error: Setting up fake worker failed: \"Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs\"."," at file:///var/task/node_modules/pdfjs-dist/build/pdf.mjs:3720:36"," at processTicksAndRejections (node:internal/process/task_queues:95:5)"]},"promise":{},"stack":["Runtime.UnhandledPromiseRejection: Error: Setting up fake worker failed: \"Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs\"."," at process. (file:///var/runtime/index.mjs:1276:17)"," at process.emit (node:events:526:35)"," at process.emit (/var/task/_vc/launcher/__sourcemap_support.js:602:21)"," at emit (node:internal/process/promises:150:20)"," at processPromiseRejections (node:internal/process/promises:284:27)"," at processTicksAndRejections (node:internal/process/task_queues:96:32)"]} Unknown application error occurred Runtime.Unknown

Works fine and dandy on localhost, think this one is related to ESM though

Luluno01 commented 9 months ago

@Luluno01 So deployed my Node/Express backend on Vercel and got this as well: Unhandled Promise Rejection {"errorType":"Runtime.UnhandledPromiseRejection","errorMessage":"Error: Setting up fake worker failed: "Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs".","reason":{"errorType":"Error","errorMessage":"Setting up fake worker failed: "Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs".","stack":["Error: Setting up fake worker failed: "Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs"."," at file:///var/task/node_modules/pdfjs-dist/build/pdf.mjs:3720:36"," at processTicksAndRejections (node:internal/process/task_queues:95:5)"]},"promise":{},"stack":["Runtime.UnhandledPromiseRejection: Error: Setting up fake worker failed: "Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs"."," at process. (file:///var/runtime/index.mjs:1276:17)"," at process.emit (node:events:526:35)"," at process.emit (/var/task/_vc/launcher/__sourcemap_support.js:602:21)"," at emit (node:internal/process/promises:150:20)"," at processPromiseRejections (node:internal/process/promises:284:27)"," at processTicksAndRejections (node:internal/process/task_queues:96:32)"]} Unknown application error occurred Runtime.Unknown

Works fine and dandy on localhost, think this one is related to ESM though

I think you might have to import the minified version as pdf.mjs uses await import(/* webpackIgnore: true */ this.workerSrc) to import the worker module dynamically, which requires manual setup to ensure the worker module being bundled separately. The minified version, in contrast, has the magic comment /* webpackIgnore: true */ stripped but still keeps the dynamic import, allowing this dynamic import to be intercepted by Next.js's Webpack. As far as I know, that's very likely why my hacky workaround tricks Webpack into bundling and registering an import path for pdf.worker.mjs.

AChangXD commented 9 months ago

@Luluno01 So deployed my Node/Express backend on Vercel and got this as well: Unhandled Promise Rejection {"errorType":"Runtime.UnhandledPromiseRejection","errorMessage":"Error: Setting up fake worker failed: "Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs".","reason":{"errorType":"Error","errorMessage":"Setting up fake worker failed: "Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs".","stack":["Error: Setting up fake worker failed: "Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs"."," at file:///var/task/node_modules/pdfjs-dist/build/pdf.mjs:3720:36"," at processTicksAndRejections (node:internal/process/task_queues:95:5)"]},"promise":{},"stack":["Runtime.UnhandledPromiseRejection: Error: Setting up fake worker failed: "Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs"."," at process. (file:///var/runtime/index.mjs:1276:17)"," at process.emit (node:events:526:35)"," at process.emit (/var/task/_vc/launcher/__sourcemap_support.js:602:21)"," at emit (node:internal/process/promises:150:20)"," at processPromiseRejections (node:internal/process/promises:284:27)"," at processTicksAndRejections (node:internal/process/task_queues:96:32)"]} Unknown application error occurred Runtime.Unknown Works fine and dandy on localhost, think this one is related to ESM though

I think you might have to import the minified version as pdf.mjs uses await import(/* webpackIgnore: true */ this.workerSrc) to import the worker module dynamically, which requires manual setup to ensure the worker module being bundled separately. The minified version, in contrast, has the magic comment /* webpackIgnore: true */ stripped but still keeps the dynamic import, allowing this dynamic import to be intercepted by Next.js's Webpack. As far as I know, that's very likely why my hacky workaround tricks Webpack into bundling and registering an import path for pdf.worker.mjs.

Yep you are right, that worked for me! Seems like Vercel also have issues finding .wasm files as well: Aborted(Error: ENOENT: no such file or directory, open '/var/task/node_modules/tesseract.js-core/tesseract-core-simd.wasm') Uncaught Exception {"errorType":"RuntimeError","errorMessage":"Aborted(Error: ENOENT: no such file or directory, open '/var/task/node_modules/tesseract.js-core/tesseract-core-simd.wasm'). Build with -sASSERTIONS for more info.","stack":["RuntimeError: Aborted(Error: ENOENT: no such file or directory, open '/var/task/node_modules/tesseract.js-core/tesseract-core-simd.wasm'). Build with -sASSERTIONS for more info."," at n (/var/task/node_modules/tesseract.js-core/tesseract-core-simd.js:13:225)"," at Ma (/var/task/node_modules/tesseract.js-core/tesseract-core-simd.js:14:143)"," at /var/task/node_modules/tesseract.js-core/tesseract-core-simd.js:14:491"]} Unknown application error occurred Runtime.Unknown

This might be webpack not bundling the .wasm as well? I never thought it would be this much headache to get two packages running on Vercel...

Luluno01 commented 9 months ago

@Luluno01 So deployed my Node/Express backend on Vercel and got this as well: Unhandled Promise Rejection {"errorType":"Runtime.UnhandledPromiseRejection","errorMessage":"Error: Setting up fake worker failed: "Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs".","reason":{"errorType":"Error","errorMessage":"Setting up fake worker failed: "Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs".","stack":["Error: Setting up fake worker failed: "Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs"."," at file:///var/task/node_modules/pdfjs-dist/build/pdf.mjs:3720:36"," at processTicksAndRejections (node:internal/process/task_queues:95:5)"]},"promise":{},"stack":["Runtime.UnhandledPromiseRejection: Error: Setting up fake worker failed: "Cannot find module '/var/task/node_modules/pdfjs-dist/build/pdf.worker.mjs' imported from /var/task/node_modules/pdfjs-dist/build/pdf.mjs"."," at process. (file:///var/runtime/index.mjs:1276:17)"," at process.emit (node:events:526:35)"," at process.emit (/var/task/_vc/launcher/__sourcemap_support.js:602:21)"," at emit (node:internal/process/promises:150:20)"," at processPromiseRejections (node:internal/process/promises:284:27)"," at processTicksAndRejections (node:internal/process/task_queues:96:32)"]} Unknown application error occurred Runtime.Unknown Works fine and dandy on localhost, think this one is related to ESM though

I think you might have to import the minified version as pdf.mjs uses await import(/* webpackIgnore: true */ this.workerSrc) to import the worker module dynamically, which requires manual setup to ensure the worker module being bundled separately. The minified version, in contrast, has the magic comment /* webpackIgnore: true */ stripped but still keeps the dynamic import, allowing this dynamic import to be intercepted by Next.js's Webpack. As far as I know, that's very likely why my hacky workaround tricks Webpack into bundling and registering an import path for pdf.worker.mjs.

Yep you are right, that worked for me! Seems like Vercel also have issues finding .wasm files as well: Aborted(Error: ENOENT: no such file or directory, open '/var/task/node_modules/tesseract.js-core/tesseract-core-simd.wasm') Uncaught Exception {"errorType":"RuntimeError","errorMessage":"Aborted(Error: ENOENT: no such file or directory, open '/var/task/node_modules/tesseract.js-core/tesseract-core-simd.wasm'). Build with -sASSERTIONS for more info.","stack":["RuntimeError: Aborted(Error: ENOENT: no such file or directory, open '/var/task/node_modules/tesseract.js-core/tesseract-core-simd.wasm'). Build with -sASSERTIONS for more info."," at n (/var/task/node_modules/tesseract.js-core/tesseract-core-simd.js:13:225)"," at Ma (/var/task/node_modules/tesseract.js-core/tesseract-core-simd.js:14:143)"," at /var/task/node_modules/tesseract.js-core/tesseract-core-simd.js:14:491"]} Unknown application error occurred Runtime.Unknown

This might be webpack not bundling the .wasm as well? I never thought it would be this much headache to get two packages running on Vercel...

Very likely. If you have to use tesseract.js on Vercel, another workaround is to bypass Next.js and register a separate folder as your function implementation (you will need to do your own vendoring/bundling/tree-shaking). See vercel.json for more details.

malikiz commented 7 months ago

I decided to follow a simple path, I downloaded the stable version from the official website. I put all the files in the public folder. Then I added this tag to my component:

<script src="/pdfjs/pdf.mjs" type="module" />

then adding code in useEffect:

  const pdfjs = window.pdfjsLib as typeof import('pdfjs-dist/types/src/pdf')
  const pdfjsWorker = await import('pdfjs-dist/build/pdf.worker.min.mjs');
  pdfjs.GlobalWorkerOptions.workerSrc = pdfjsWorker;

  const pdfDocument = pdfjs.getDocument('http://localhost:3000/pdf-files/myFile.pdf')

  console.log('pdfDocument', pdfDocument);
huozhi commented 7 months ago

Hi, there're some bundling fixes are landed on the canary (14.0.5-canary.45) I tested against latest canary it works well now. getDocument is a valid function. Another thing to notice that you don't need to remove .default to get the full module imports await import('pdfjs-dist')

Luluno01 commented 7 months ago

Hi, there're some bundling fixes are landed on the canary (14.0.5-canary.45) I tested against latest canary it works well now. getDocument is a valid function. Another thing to notice that you don't need to remove .default to get the full module imports await import('pdfjs-dist')

Good to hear that! Could you please elaborate a bit on what the fix is and how it fixes the issue? Will that fix land on 13.x, or how can we cherry-pick that that fix to 13.x? Thanks a lot.

huozhi commented 7 months ago

There're few module resolution related bundling fixes applied after 14.0.4, on canary now. Unfortunately we're not going to apply them back to 13.x.

Luluno01 commented 7 months ago

There're few module resolution related bundling fixes applied after 14.0.4, on canary now. Unfortunately we're not going to apply them back to 13.x.

Okayyyy... Thank you for your reply. Sounds like I have to upgrade to 14.0.5+ later to be able to use pdfjs with less workaround.

dhallX commented 7 months ago

is there an updated solution for this? facing the same issues: import trace for request module/Release/canvas.node

next version 14.0.5