hercules-ci / hercules-ci-agent

https://hercules-ci.com build and deployment agent
Apache License 2.0
100 stars 19 forks source link

OOM events right after start `mmap 4096 bytes at (nil): Cannot allocate memory` #514

Closed Mic92 closed 1 year ago

Mic92 commented 1 year ago

Description

The machine has plenty memory free and hercules-ci-agent crashes right after the start

To Reproduce

systemctl restart hercules-ci-agent

Happens on build02.nix-community.org, we can provide access to the machine as needed:

We already try to increase the stack size:

https://github.com/nix-community/infra/pull/546

A crash can be seen here: https://hercules-ci.com/github/nix-community/nix-init/jobs/691

Expected behavior

no crashes.

Logs

pr 28 18:53:54 build02 hercules-ci-agent[929782]: [2023-04-28 18:53:54][Agent][Info][build02][PID 929782][ThreadId 23][agent-version:0.9.11][main:Hercules.Agent hercules-ci-agent/Hercules/Agent.hs:115:19] Agent online.
pr 28 18:58:17 build02 systemd[1]: Stopping hercules-ci-agent.service...
pr 28 18:58:17 build02 systemd[1]: hercules-ci-agent.service: Deactivated successfully.
pr 28 18:58:17 build02 systemd[1]: Stopped hercules-ci-agent.service.
pr 28 18:58:17 build02 systemd[1]: hercules-ci-agent.service: Consumed 386ms CPU time, no IO, received 20.5K IP traffic, sent 9.4K IP traffic.
pr 28 18:58:18 build02 systemd[1]: Starting hercules-ci-agent.service...
pr 28 18:58:18 build02 hercules-ci-agent[1010079]: hercules-ci-agent: mmap 4096 bytes at (nil): Cannot allocate memory
pr 28 18:58:18 build02 hercules-ci-agent[1010079]: hercules-ci-agent: Try specifying an address with +RTS -xm<addr> -RTS
pr 28 18:58:25 build02 systemd[1]: hercules-ci-agent.service: Control process exited, code=dumped, status=11/SEGV
pr 28 18:58:25 build02 systemd[1]: hercules-ci-agent.service: Failed with result 'core-dump'.
pr 28 18:58:25 build02 systemd[1]: Failed to start hercules-ci-agent.service.
pr 28 18:58:25 build02 systemd[1]: hercules-ci-agent.service: Consumed 4.241s CPU time, no IP traffic.
mic92@build02:~]$ free -m
              total        used        free      shared  buff/cache   available
em:           64245       39123       23030          18        2091       23848
wap:          32122           0       32122

Platform / Version

Best to go to https://hercules-ci.com/dashboard and click on the agents' tab for the account you're interested in. hercules-ci-agent --help version

Apr 28 18:53:54 build02 hercules-ci-agent[929782]: [2023-04-28 18:53:54][Agent][Info][build02][PID 929782][ThreadId 23][agent-version:0.9.11][main:Hercules.Agent hercules-ci-agent/Hercules/Agent.hs:115:19] Agent online.

zowoq commented 1 year ago

https://github.com/nix-community/infra/commit/a96682a55bd00992998ec64f518da49a05b9e6a9

Seems to have been caused by something in nixpkgs, reverting our last flake update resolved the problem.

The nixpkgs diff includes a staging-next merge:

https://github.com/NixOS/nixpkgs/compare/5d91a896449650d13fbba8c8abc65d1615c2b654...3d409345416cda845407e3075f5eaf7a590d9db5

zowoq commented 1 year ago

@roberth

I tried updating the flake again but now hercules-ci-api x86_64-linux is failing on nixpkgs master which breaks the agent.

https://hydra.nixos.org/build/217758149

roberth commented 1 year ago

but now hercules-ci-api x86_64-linux is failing on nixpkgs master which breaks the agent.

https://hydra.nixos.org/build/217758149

Yikes, that looks like a corrupted store path in the hercules-ci-api-core dependency output. Or a ghc/haskell/... that writes empty files and then succeeds.

zowoq commented 1 year ago

I skimmed through the haskell room and noticed that the same error was posted there:

https://app.element.io/#/room/#haskell:nixos.org/$1CSEVH8JOgJ3EYYaL866fLxnW7KZjPK8lZH2vVcWYdM

ghc: mmap 4096 bytes at (nil): Cannot allocate memory
ghc: Try specifying an address with +RTS -xm<addr> -RTS
[1]    1231767 segmentation fault (core dumped)  ghci

Seems to be a kernel issue:

https://lore.kernel.org/regressions/20230303201120.kjvrnqi65xll5cqg@revolver/T/

Mic92 commented 1 year ago

We could potentially downgrade for now:

roberth commented 1 year ago

The latest commit that appears to complete the fix would be https://github.com/torvalds/linux/commit/0fa99fdfe1b38da396d0b2d1496a823bcd0ebea0, merged into linux 6.3-rc4. It and related commits been backported and queued for a 6.1.x release https://github.com/gregkh/linux/commit/0608b3da04f5063fe503b7f9287ebb9c9b494fd7 and 6.2.x release https://github.com/gregkh/linux/commit/48c427450711cbc537c9d0b297ea7da9b89d4137.

So it seems that we're waiting for a tag at this point. Until then, an upgrade to 6.3 or some downgrade seems like good workarounds.

roberth commented 1 year ago

I might have put too much faith in those threads.

Reverting linux/58c5d0d6d522112577c7eeb71d382ea642ed7be4 fixes the regression, based on checks.x86_64-linux.agent-function-test results.

Example NixOS config:

boot.kernelPackages = pkgs.linuxKernel.packages.linux_6_2.extend (self: super: {
      kernel = super.kernel.override (o: {
        kernelPatches = o.kernelPatches ++ [ { name = "wip"; patch = ./revert-58c5d0d6d522112577c7eeb71d382ea642ed7be4.patch; } ];
      });
    });
./revert-58c5d0d6d522112577c7eeb71d382ea642ed7be4.patch ```diff diff --git a/mm/mmap.c b/mm/mmap.c index d5475fbf5729..ff68a67a2a7c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1518,8 +1518,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags) */ static unsigned long unmapped_area(struct vm_unmapped_area_info *info) { - unsigned long length, gap, low_limit; - struct vm_area_struct *tmp; + unsigned long length, gap; MA_STATE(mas, ¤t->mm->mm_mt, 0, 0); @@ -1528,29 +1527,12 @@ static unsigned long unmapped_area(struct vm_unmapped_area_info *info) if (length < info->length) return -ENOMEM; - low_limit = info->low_limit; -retry: - if (mas_empty_area(&mas, low_limit, info->high_limit - 1, length)) + if (mas_empty_area(&mas, info->low_limit, info->high_limit - 1, + length)) return -ENOMEM; gap = mas.index; gap += (info->align_offset - gap) & info->align_mask; - tmp = mas_next(&mas, ULONG_MAX); - if (tmp && (tmp->vm_flags & VM_GROWSDOWN)) { /* Avoid prev check if possible */ - if (vm_start_gap(tmp) < gap + length - 1) { - low_limit = tmp->vm_end; - mas_reset(&mas); - goto retry; - } - } else { - tmp = mas_prev(&mas, 0); - if (tmp && vm_end_gap(tmp) > gap) { - low_limit = vm_end_gap(tmp); - mas_reset(&mas); - goto retry; - } - } - return gap; } @@ -1566,8 +1548,7 @@ static unsigned long unmapped_area(struct vm_unmapped_area_info *info) */ static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info) { - unsigned long length, gap, high_limit, gap_end; - struct vm_area_struct *tmp; + unsigned long length, gap; MA_STATE(mas, ¤t->mm->mm_mt, 0, 0); /* Adjust search length to account for worst case alignment overhead */ @@ -1575,31 +1556,12 @@ static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info) if (length < info->length) return -ENOMEM; - high_limit = info->high_limit; -retry: - if (mas_empty_area_rev(&mas, info->low_limit, high_limit - 1, + if (mas_empty_area_rev(&mas, info->low_limit, info->high_limit - 1, length)) return -ENOMEM; gap = mas.last + 1 - info->length; gap -= (gap - info->align_offset) & info->align_mask; - gap_end = mas.last; - tmp = mas_next(&mas, ULONG_MAX); - if (tmp && (tmp->vm_flags & VM_GROWSDOWN)) { /* Avoid prev check if possible */ - if (vm_start_gap(tmp) <= gap_end) { - high_limit = vm_start_gap(tmp); - mas_reset(&mas); - goto retry; - } - } else { - tmp = mas_prev(&mas, 0); - if (tmp && vm_end_gap(tmp) > gap) { - high_limit = tmp->vm_start; - mas_reset(&mas); - goto retry; - } - } - return gap; } ```

Mailing list links (lore.kernel.org)

figsoda commented 1 year ago

btw fetchpatch has a revert option if you don't want a local copy of the patch

roberth commented 1 year ago

Patch is in master. Track it

zowoq commented 1 year ago

Thank you, I've deployed it on our machines and switched back to 6.1 kernel, seems to be fine.

roberth commented 1 year ago

Upstream merge reached nixpkgs master in https://github.com/NixOS/nixpkgs/pull/233927, backported in https://github.com/NixOS/nixpkgs/pull/234175.