oven-sh / bun

Incredibly fast JavaScript runtime, bundler, test runner, and package manager – all in one
https://bun.sh
Other
74.44k stars 2.78k forks source link

Error with `pdf-text-reader` and NODE_ENV=production #9637

Open randompixel opened 8 months ago

randompixel commented 8 months ago

What version of Bun is running?

1.0.35+940448d6b

What platform is your computer?

Darwin 23.2.0 arm64 arm

What steps can reproduce the bug?

If I attempt to use pdf-text-reader (which itself uses Mozilla's pdf.js and this is where the error is happening) when not in dev mode it fails.

So I switched the oven/bun docker build I have to call bun run dev directly instead of building it. Except that also failed when I went into production. After a lot more trial & error

So it appears that build also sets NODE_ENV=production but then can't resolve require when it does

Minimal reproduction:

import { readPdfText } from "pdf-text-reader";

/** Define the return type for a FileParser class' parse function */
type ParsedFile = {
    fileName: string;
    fileSize: number;
    fileType: string;
    body: string;
    footers?: string;
    headers?: string;
};

/** Define the signature for a FileParser class */
type ParseFunction = (file: File) => Promise<ParsedFile>;

/** Define the type of classes that the factory can return */
interface FileParser {
    parse: ParseFunction;
}

class PdfParser implements FileParser {
    public async parse(file: File): Promise<ParsedFile> {
        const blob = file;
        const stream = await blob.arrayBuffer();
        const readText = await readPdfText({ data: stream, worker: null });

        return {
            fileName: blob.name,
            fileSize: blob.size,
            fileType: blob.type,
            body: readText,
        };
    }
}

Bun.serve({
  port: 4000,
  async fetch(req) {
    const url = new URL(req.url);

    // parse formdata at /action
    if (url.pathname === '/parse') {
      const formdata = await req.formData();
      const file = formdata.get('file');
            console.log(file);
            const parser = new PdfParser();
            const body = await parser.parse(file);
            return new Response(body.body);
        }

        return new Response("Not Found", { status: 404 });
    }
});

What is the expected behavior?

POST a file through form-data and it parses the text out of the PDF when running in NODE_ENV=production

What do you see instead?

Setting up fake worker failed: \"Can't find variable: require\"

Additional information

No response

pfgithub commented 8 months ago

I don't see a problem running with env NODE_ENV=production bun a.js, but this is a minimal reproduction for the issue with bun build --target bun:

// a.js
if(typeof require === "function") {
  const mymodule = eval("require")("./b.js");
  mymodule.main();
}

// b.js
module.exports.main = function() {
  console.log("hello from b.js");
}
bun build a.js --outdir ./out --target bun
bun ./out/a.js
# should log "hello from b.js", instead errors

pdfjs-dist seems to be hiding the fake worker import behind eval("require"), maybe so when bundled for the browser it doesn't get imported? Although for the browser it seems designed to run with no bundler because it embeds a <script> element to load the fake worker.


While trying to make a smaller reproduction, I got a different error (on Darwin 23.2.0 arm64 arm)

// a.js
import { readPdfText } from "pdf-text-reader";

const file = await Bun.file("dummy.pdf").arrayBuffer();
const readText = await readPdfText({ data: file, worker: null });
console.log(readText);
$> bun build a.js --outdir ./out --target bun --sourcemap=external
fish: Job 1, 'bun build a.js --outdir ./out -…' terminated by signal SIGBUS (Misaligned address error)
Exited with code [SIGBUS]

Removing sourcemap=external the error doesn't show up.

This seems to be caused by strings.wtf8ByteSequenceLengthWithInvalid(remaining[0]); returning a number larger than remaining.len in sourcemap.zig:

https://github.com/oven-sh/bun/blob/d113803777b14f317188dbfa6bd4e49c54dce9fb/src/sourcemap/sourcemap.zig#L915

randompixel commented 6 months ago

After upgrading to the v5 branch for pdf-text-reader, bun 1.1.8 won't even start

dyld[68449]: missing symbol called
error: script "dev" was terminated by signal SIGABRT (Abort)
[1]    68448 abort      bun dev

Both 5.0.1 and 5.1.0 releases of pdf-text-reader fail with the above error https://github.com/electrovir/pdf-text-reader/releases

Unfortunately both of the above fix a security issue with pdf.js that is being reported.

Jarred-Sumner commented 3 months ago

@randompixel pdf-text-reader seems to be using either V8 C++ API or libuv. Please follow along in #4290

@190n is actively working on supporting V8 C++ APIs in Bun

anuragk15 commented 3 months ago

Is there any other way to read PDF files in Bun?