[Bug]: Worker exits but jest process never finishes

vilmosioo commented 2 years ago

Version

28.1.3

Steps to reproduce

Run jest with worker count higher than 1. Empty test that just sleeps for a minute will do.
Kill one of the workers intentionally. In practice, they may die for whatever reason, for us we suspect it is OOM.
Jest will indefinitely hang.

However the workers may crash with other signals and those scenarios are not covered. In our case, after some debugging, the signal is null. For some reason these workers are crashing in jest-runtime at the compileFunction.call line, and causes a null exit code, which gets ignored. jest-runner waits on a thread pool that'll never fulfil the submitted job.

The signal appears to be SIGKILL instead of SIBABRT , and the exitCode appears to be null. Please see screenshots of the debug process.

The above outputs

ChildProcessWorker exit: exitCode = null, pid = 1378, signal = SIGKILL

We apologize for the lack of a minimal reproduction, but hope the thorough investigation given will substitute.

Expected behavior

Jest exits when one of the workers crashes for whatever reason.

Actual behavior

Jest hangs when workers are unexpectedly SIGKILL'ed.

Additional context

This does not happen with typescript v4.6.4. This only happens after we upgraded to v4.7.4. This may not be relevant, we are thinking this is an OOM.

Environment

System:
ubuntu-jdk-npm:0.40.1920

Binaries:
Typescript 4.7.4
Node 16.13.0

SimenB commented 2 years ago

@phawxby thoughts?

gluxon commented 2 years ago

Hi everyone — @vilmosioo is my teammate and I wanted to give a bit more info.

We spent a few days investigating this before filing the issue. I do believe the workers are being OOMKilled, but don't have logging to verify since this is running in a CircleCI docker container. I hope the report contains enough information to be helpful still. Please let us know if there's any extra details that would be useful.

We're not actively seeing this issue day to day, but thought reporting it would help the Jest team and other users. Thanks for making Jest.

phawxby commented 2 years ago

Exit codes and exit signals were probably the most problematic bit of the fixes that I put in recently because they're extremely inconsistent between platforms. However I thought I had covered the OOM scenario, and I do have a test which forcibly induces an OOM crash here. https://github.com/facebook/jest/blob/main/packages/jest-worker/src/workers/__tests__/WorkerEdgeCases.test.js#L238

My questions at this point would be.

Does this happen in jest v28? Is this a regression I've introduced?
Can you create a repro within the WorkerEdgeCases test suite I can use for investigation?

gluxon commented 2 years ago

Thanks for taking a look @phawxby. I think I should have been more specific about the kind of OOM we think we're seeing.

Looking over the WorkerEdgeCases.test.js file, it looks like that catches OOMs within node itself.
I think what's happening for us is the process is dying externally due to Linux's oom_kill mechanism selecting a worker process to kill. We've noticed this happen when the CircleCI container's cgroup hits its allocated memory limit, and oom_kill selects the largest process to kill.

Does this happen in jest v28? Is this a regression I've introduced?

This does happen with Jest v28.1.0. Apologies, we actually haven't had a chance to test your fix on v29.0.0 yet. From inspecting source code, we didn't believe it would take effect since we're seeing SIGKILL rather than SIGABRT.

I think your fix is strictly an improvement. Thank you. 🙂 We'll work on migrating to Jest v29 soon.

Can you create a repro within the WorkerEdgeCases test suite I can use for investigation?

The full repro is platform-specific due to OOMKiller being Linux-specific. I think we can simulate OOMKiller by calling subprocess.kill("SIGKILL") on the worker process. Is that sort of test worth adding here? If so we can create a PR adding that to WorkerEdgeCases.

Exit codes and exit signals were probably the most problematic bit of the fixes that I put in recently because they're extremely inconsistent between platforms.

I'm sure there's a good reason, but I was curious why jest-worker might want to exit silently depending on the exit codes and signals?

https://github.com/facebook/jest/blob/dde24c85698259d9ce887ede9646847d0fff9554/packages/jest-worker/src/workers/ChildProcessWorker.ts#L374-L376

I see the other branches in this conditional either throw an error or retry.

phawxby commented 2 years ago

@gluxon you make a good point. That wasn't actually code I originally wrote, I just worked around it so didn't really dig into why it's doing what it's doing. But you're right:

If it's an exit mode we know to be bad (OOM) then we should probably terminate with an error
If it's an exit and we're not active in the process of shutting down the worker then we should restart the worker because it's still needed

We probably shouldn't pay any attention to what the exit code or signal actually is unless there's a known problem, otherwise we just work based on what the intended state is

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 30 days.

gluxon commented 2 years ago

That wasn't actually code I originally wrote, I just worked around it so didn't really dig into why it's doing what it's doing.

Ah that clarification is helpful. Thank you for your work to make this more correct on internal OOMs.

We probably shouldn't pay any attention to what the exit code or signal actually is unless there's a known problem, otherwise we just work based on what the intended state is

Agree with this. I'm going to try creating a test case that mimics what OOMKiller does and see how the Jest maintainers feel about any changes to make these forms of OOMs more clear.

gluxon commented 2 years ago

Created a minimal repro. This test hangs forever and never hits the 3s timeout.

// src/killed.test.ts

export async function wait(ms: number) {
    return new Promise((resolve) => setTimeout(resolve, ms));
}

test("jest-worker externally killed", async () => {
    await wait(2_000);
}, 3_000);

setTimeout(() => {
    // Self-kill to make repro easier.
    process.kill(process.pid);
}, 1_000);

https://github.com/gluxon/test-jest-worker-killed-repro

hanging

gluxon commented 2 years ago

Starting a fix in https://github.com/facebook/jest/pull/13566. Any reviews or feedback welcome!

SimenB commented 2 years ago

https://github.com/facebook/jest/releases/tag/v29.3.0

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. Please note this issue tracker is not a help forum. We recommend using StackOverflow or our discord channel for questions.

jestjs / jest