gitlab CI (bullseye-slim) - Error: end() called before task completed

mattpr commented 5 months ago

Getting an error on gitlab CI using docker image node:16-bullseye-slim which I can't reproduce locally (on macos).

exiftool-vendored version 24.3.0.

some.pdf in the error below is generally the FIRST pdf to be processed in the build job...so I suspect the error message is just a red herring and there is rather some environmental problem (missing dep, perl, etc that isn't installed with with npm ci).

So I guess this is more question than bug report... any recommendations on how to troubleshoot underlying issues with the batch-cluster so I have a more useful error message?

Build job

const srcDir = 'src/files',
    destDir = 'dist/files';

async function stripMetaAndCopy () {
    const rePDF = /\.pdf$/i,
        ExifTool = require('exiftool-vendored').ExifTool,
        exiftool = new ExifTool({ taskTimeoutMillis: 5000 }),
        files = glob.sync(srcDir+'/**/*', {
            nodir: true,
        });

    for (const file of files) {
        let fileParsed = parsePath(srcDir, file),
            outDir = path.join(destDir, fileParsed.sub),
            outFile = path.join(outDir, fileParsed.base);

        if (rePDF.exec(file)) { // looks like a pdf
            if (isNewer(file, outFile)) {
                mkdirp(outDir);

                fs.copyFileSync(file, outFile);

                try {
                    await exiftool.write(outFile, {
                        Title: null,
                        Author: null,
                        Subject: null,
                        Creator: null,
                        Producer: null,
                        Keywords: null,
                    }, ['-overwrite_original_in_place']);  // -o outFile does not work, error if already exists, no overwrite option.
                }
                catch (e) {
                    console.error('exiftool-vendored threw an error:', e);
                    exiftool.end();
                    throw e;
                }

                // qpdf linearize ommitted
            }
        }
    }
    exiftool.end();

}

gitlab CI error

exiftool-vendored threw an error: Error: end() called before task completed ({"gracefully":false,"lastTask":{"command":"-sep\n\u001f\n-E\n-Title=\n-Author=\n-Subject=\n-Creator=\n-Producer=\n-Keywords=\n-overwrite_original_in_place\n/builds/GROUP_NAME/PROJECT_NAME/dist/files/some.pdf\n-ignoreMinorErrors\n-execute\n","taskId":2,"args":["-sep","\u001f","-E","-Title=","-Author=","-Subject=","-Creator=","-Producer=","-Keywords=","-overwrite_original_in_place","/builds/GROUP_NAME/PROJECT_NAME/dist/files/some.pdf"],"errors":[],"warnings":[],"sourceFile":"/builds/GROUP_NAME/PROJECT_NAME/dist/files/some.pdf"}})
    at BatchProcess._BatchProcess_end (/builds/GROUP_NAME/PROJECT_NAME/node_modules/batch-cluster/dist/BatchProcess.js:373:29)
    at async observe (/builds/GROUP_NAME/PROJECT_NAME/node_modules/batch-cluster/dist/Deferred.js:108:19)
[10:50:25] 'stripMetaAndCopy' errored after 1.8 s
[10:50:25] Error: end() called before task completed ({"gracefully":false,"lastTask":{"command":"-sep\n\u001f\n-E\n-Title=\n-Author=\n-Subject=\n-Creator=\n-Producer=\n-Keywords=\n-overwrite_original_in_place\n/builds/GROUP_NAME/PROJECT_NAME/dist/files/some.pdf\n-ignoreMinorErrors\n-execute\n","taskId":2,"args":["-sep","\u001f","-E","-Title=","-Author=","-Subject=","-Creator=","-Producer=","-Keywords=","-overwrite_original_in_place","/builds/GROUP_NAME/PROJECT_NAME/dist/files/some.pdf"],"errors":[],"warnings":[],"sourceFile":"/builds/GROUP_NAME/PROJECT_NAME/dist/files/some.pdf"}})
    at BatchProcess._BatchProcess_end (/builds/GROUP_NAME/PROJECT_NAME/node_modules/batch-cluster/dist/BatchProcess.js:373:29)
    at async observe (/builds/GROUP_NAME/PROJECT_NAME/node_modules/batch-cluster/dist/Deferred.js:108:19)

aside: I also noticed locally (macos) that after a build crash there is a zombie perl process running exiftool that is never cleaned up. I've added my own SIG/process cleanup try and prevent this from happening in the future, but would be nice if maybe exiftool-vendored handled this internally using a process.on('uncaughtException',... or process.on('exit',... to try and kill off any batch-cluster stuff.

mceachen commented 5 months ago

Node 16 isn't supported.
16-bullseye-slim does have perl preinstalled, but if your CI image is something else (like the alpine variant), know that your image needs perl installed.
It'd be helpful to include a working example. Where is mkdirp coming from?
You might want to either await mkdirp or use a sync mkdirp variant.
BatchCluster does register for all childProcess exit events. See the node docs. Listening for process.on("UncaughtException") would be way overstepping, given there's no way for me to only capture exceptions batch-cluster is "responsible" for
You aren't awaiting exiftool.end(). I suspect this is the main issue causing your zombies.

mattpr commented 5 months ago

Thanks for the detailed reply @mceachen

I knew I should have taken more time to strip down the example, sorry.

Here is a stripped down repro.

Doesn't seem that node16 is the issue as 18 fails with same error...

Node 16

Node 18

Node 16 isn't supported.

Old project that is not actively developed. But that doesn't seem to be the issue here as 18 has the same issue.

16-bullseye-slim does have perl preinstalled, but if your CI image is something else (like the alpine variant), know that your image needs perl installed.

If it turns out this is a missing perl or whatever issue, that information isn't getting surfaced. Maybe that isn't it.

It'd be helpful to include a working example. Where is mkdirp coming from? You might want to either await mkdirp or use a sync mkdirp variant.

Sure. Repro above stripping out copy and mkdir stuff to show that isn't related to the problem (it was synchronous mkdirp under the hood FWIW)

BatchCluster does register for all childProcess exit events. See the node docs. Listening for process.on("UncaughtException") would be way overstepping, given there's no way for me to only capture exceptions batch-cluster is "responsible" for

Fair enough. In a parallelized build there are a lot of places the node process can crash. I just wanted to make sure we weren't ending up with zombie processes due to build crashes where .end() was not getting called. But maybe it is related to your next point and batchcluster does a sane job of terminating child processes during a node crash.

To be clear, I wasn't implying you would handle other codes' exceptions or prevent node from crashing...just that you would detect node is crashing and do some cleanup to avoid zombie processes if batchcluster isn't doing that itself already.

You aren't awaiting exiftool.end(). I suspect this is the main issue causing your zombies.

Thanks. Could be. I'll open a separate issue if I continue to notice zombie processes after build crashes.

mceachen commented 5 months ago

I pulled your repo and tried to reproduce locally:

node:18-bullseye works fine
node:18-bullsye-slim fails with the error you described.

If you change your exiftool instance to include a logger:

        exiftool = new ExifTool({ taskTimeoutMillis: 5000, logger: () => require("batch-cluster").ConsoleLogger }),

You'll see the following output:

$ gulp build
[04:20:13] Using gulpfile /code/gulpfile.js
[04:20:13] Starting 'build'...
BatchProcess(104).onStderr(): Can't locate mro.pm in @INC (you may need to install the mro module) (@INC contains: /code/node_modules/exiftool-vendored.pl/bin/lib /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.32.1 /usr/local/share/perl/5.32.1 /usr/lib/x86_64-linux-gnu/perl5/5.32 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl-base /usr/lib/x86_64-linux-gnu/perl/5.32 /usr/share/perl/5.32 /usr/local/lib/site_perl) at /usr/lib/x86_64-linux-gnu/perl-base/overload.pm line 123.

Turns out that Node's -slim image doesn't include some perl modules that ExifTool relies on, including mro.

It's even easier to validate by hand:

$ docker run -v.:/code -it node:18-bullseye-slim bash
root@5c5130e1a33d:/# cd /code
root@5c5130e1a33d:/code# node_modules/exiftool-vendored.pl/bin/exiftool -Creator= /code/src/files/dummy.pdf
Can't locate mro.pm in @INC (you may need to install the mro module) (@INC contains: node_modules/exiftool-vendored.pl/bin/lib /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.32.1 /usr/local/share/perl/5.32.1 /usr/lib/x86_64-linux-gnu/perl5/5.32 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl-base /usr/lib/x86_64-linux-gnu/perl/5.32 /usr/share/perl/5.32 /usr/local/lib/site_perl) at /usr/lib/x86_64-linux-gnu/perl-base/overload.pm line 123.

I'll update the docs and error handling to detect this and throw this as an error in the future.

mattpr commented 5 months ago

Thanks for taking another look at this. No problem to add missing deps to the slim image during the build. Very helpful to have the underlying issue surfaced of course. I also appreciate the debugging tip about adding the logger.

Thanks for the fix!

photostructure / exiftool-vendored.js

gitlab CI (bullseye-slim) - Error: end() called before task completed #168