Closed vilmosioo closed 2 years ago
@phawxby thoughts?
Hi everyone — @vilmosioo is my teammate and I wanted to give a bit more info.
We spent a few days investigating this before filing the issue. I do believe the workers are being OOMKilled, but don't have logging to verify since this is running in a CircleCI docker container. I hope the report contains enough information to be helpful still. Please let us know if there's any extra details that would be useful.
We're not actively seeing this issue day to day, but thought reporting it would help the Jest team and other users. Thanks for making Jest.
Exit codes and exit signals were probably the most problematic bit of the fixes that I put in recently because they're extremely inconsistent between platforms. However I thought I had covered the OOM scenario, and I do have a test which forcibly induces an OOM crash here. https://github.com/facebook/jest/blob/main/packages/jest-worker/src/workers/__tests__/WorkerEdgeCases.test.js#L238
My questions at this point would be.
Thanks for taking a look @phawxby. I think I should have been more specific about the kind of OOM we think we're seeing.
WorkerEdgeCases.test.js
file, it looks like that catches OOMs within node
itself.oom_kill
mechanism selecting a worker process to kill. We've noticed this happen when the CircleCI container's cgroup hits its allocated memory limit, and oom_kill
selects the largest process to kill.Does this happen in jest v28? Is this a regression I've introduced?
This does happen with Jest v28.1.0
. Apologies, we actually haven't had a chance to test your fix on v29.0.0
yet. From inspecting source code, we didn't believe it would take effect since we're seeing SIGKILL
rather than SIGABRT
.
I think your fix is strictly an improvement. Thank you. 🙂 We'll work on migrating to Jest v29
soon.
Can you create a repro within the WorkerEdgeCases test suite I can use for investigation?
The full repro is platform-specific due to OOMKiller being Linux-specific. I think we can simulate OOMKiller by calling subprocess.kill("SIGKILL")
on the worker process. Is that sort of test worth adding here? If so we can create a PR adding that to WorkerEdgeCases
.
Exit codes and exit signals were probably the most problematic bit of the fixes that I put in recently because they're extremely inconsistent between platforms.
I'm sure there's a good reason, but I was curious why jest-worker
might want to exit silently depending on the exit codes and signals?
I see the other branches in this conditional either throw an error or retry.
@gluxon you make a good point. That wasn't actually code I originally wrote, I just worked around it so didn't really dig into why it's doing what it's doing. But you're right:
We probably shouldn't pay any attention to what the exit code or signal actually is unless there's a known problem, otherwise we just work based on what the intended state is
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 30 days.
That wasn't actually code I originally wrote, I just worked around it so didn't really dig into why it's doing what it's doing.
Ah that clarification is helpful. Thank you for your work to make this more correct on internal OOMs.
We probably shouldn't pay any attention to what the exit code or signal actually is unless there's a known problem, otherwise we just work based on what the intended state is
Agree with this. I'm going to try creating a test case that mimics what OOMKiller does and see how the Jest maintainers feel about any changes to make these forms of OOMs more clear.
Created a minimal repro. This test hangs forever and never hits the 3s timeout.
// src/killed.test.ts
export async function wait(ms: number) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
test("jest-worker externally killed", async () => {
await wait(2_000);
}, 3_000);
setTimeout(() => {
// Self-kill to make repro easier.
process.kill(process.pid);
}, 1_000);
Starting a fix in https://github.com/facebook/jest/pull/13566. Any reviews or feedback welcome!
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. Please note this issue tracker is not a help forum. We recommend using StackOverflow or our discord channel for questions.
Version
28.1.3
Steps to reproduce
Related and recent fix https://github.com/facebook/jest/pull/13054/files#diff-e99351d361fdd6f55d39c7c41af94576baccc2b9dce73a5bfd0b516c6ab648e9
However the workers may crash with other signals and those scenarios are not covered. In our case, after some debugging, the signal is null. For some reason these workers are crashing in jest-runtime at the compileFunction.call line, and causes a null exit code, which gets ignored. jest-runner waits on a thread pool that'll never fulfil the submitted job.
The signal appears to be SIGKILL instead of SIBABRT , and the exitCode appears to be null. Please see screenshots of the debug process.
The above outputs
ChildProcessWorker exit: exitCode = null, pid = 1378, signal = SIGKILL
We apologize for the lack of a minimal reproduction, but hope the thorough investigation given will substitute.
Expected behavior
Jest exits when one of the workers crashes for whatever reason.
Actual behavior
Jest hangs when workers are unexpectedly SIGKILL'ed.
Additional context
This does not happen with typescript v4.6.4. This only happens after we upgraded to v4.7.4. This may not be relevant, we are thinking this is an OOM.
Environment