hercules-ci / hercules-ci-agent

https://hercules-ci.com build and deployment agent
Apache License 2.0
97 stars 19 forks source link

Crash on purs-nix #445

Closed roberth closed 1 year ago

roberth commented 2 years ago

Description

worker exited with -11

(=segfault)

To Reproduce

agent rev 12d421bef8826efb2be10e0e39d7e8c60e497798

agent.follows.nixpkgs rev 16236dd7e33ba4579ccd3ca8349396b2f9c960fe

https://hercules-ci.com/github/purs-nix/purs-nix/jobs/80

srid commented 1 year ago

@roberth Here's a minimal repro. Evaluate builtins.match ".*" (toString (builtins.genList (x: x) 100000)) and HCI will segfault.

Repo: https://github.com/purs-nix/hci-segfault-repro

roberth commented 1 year ago

The problem

The regex library in the GNU C++ library eats a lot of stack for this regex, surprisingly. The stack runs out and that condition is not detected and reported properly. It also happens in Nix itself.

$ nix-instantiate --expr --eval 'builtins.match ".*" (toString (builtins.genList (x: x) 100000))'
error: stack overflow (possible infinite recursion)

but for a humongous stack, it works:

# 180 MB because number unit is 1KB
$ ulimit -Ss $[180 * 1024]; nix-instantiate --expr --eval 'builtins.match ".*" (toString (builtins.genList (x: x) 100000))'
[ ]

Workaround

You may be able to work around this by setting a large stack size for system threads.

    systemd.services.hercules-ci-agent = {
      serviceConfig.LimitSTACK = 128 * 1024 * 1024;
    };

This will allocate a lot of virtual memory for the stack, most of which is never mapped to real memory. Also the evaluation worker is a transient process, so the real memory is released after evaluating a job. Most threads in the system are Haskell's green threads, which are always tiny.

Alternative workaround, regex performance

There's also a potential workaround that could be done in Nix code, which is to split large strings into smaller ones before applying the regex. E.g. splitting by lines first. I'd say that this "optimization" is completely unnecessary, if it wasn't for the fact that Nix always matches the whole input, as though it has $...^. Clearly at least libstdc++ isn't clever enough to optimize away .*^ (considering your example with .* and the fact that . even matches newlines in Nix).

Solutions

The regex stack behavior is unacceptable, and there's already a bug report in the upstream libstdc++.

An intermediate solution is to avoid libstdc++. The LLVM project's libc++ does not suffer from this problem, as evidenced by nix-instantiate on darwin. Building hercules-ci-agent with LLVM for Linux runs into one or two cross-compilation issues. (pkgsLLVM is implemented with the cross machinery.)

Another issue is that the stack overflow is reported as a crash instead of, well, a stack overflow. This requires either a small patch to Nix or a large change to the hercules-ci-agent process architecture.

You can expect a PR tomorrow :)