ocaml / infrastructure

WIki to hold the information about the machine resources available to OCaml.org
40 stars 9 forks source link

OCluster Scheduler OOM #118

Closed mtelvers closed 1 month ago

mtelvers commented 1 month ago

Following a recent reboot of OCluster scheduler to install Debian updates, the scheduler's heap consumes all the system memory over a period of about 12 hours. The service is subsequently killed by Linux OOM and restarts.

Investigations are ongoing.

mtelvers commented 1 month ago

I have reverted to the previous docker image built. So far this has a stable memory usage.

The old build is available on Docker hub, providing you know the SHA:

ocurrent/ocluster-scheduler:live@sha256:1d74ed0553a0e7bc7bd0f8783ab9f2c07cb5bf76f007b68fc330d9a5c6b53617 
mtelvers commented 1 month ago

The latest build is 2023-04-20, commit ec7f1e9b01ec2a8e9985bbef4f62569a82b36ffe, live-scheduler branch. The previous build was 2023-03-03, commit d177823e29803387eb12e2db9e55981ae9f00a2f

i.e. these commits

git log --oneline ...v0.2.1
ec7f1e9 (HEAD, origin/live-scheduler) Bump Dockerfiles, restore Ubuntu risc-v
ecb8394 Update lower bounds
fbb7956 Fix Dockerfile now that ocluster-worker has been released
cbe339c Support OBuilder Docker backend on Windows and Linux (#143)
ec2aee4 Add multi-arch sha for alpine image. (#225)
95a6c9f Merge pull request #224 from tmcgilchrist/alpine-worker
dae96aa Add Docker worker build for alpinelinux
c46c047 Merge pull request #223 from mtelvers/show-option
66ff81d Added option to show
9fabc9b gha: upload Windows builds as artifacts
52fa410 gha: switch to sunset version of opam-repository-mingw
1664797 gha: revert to normal opam commands
17b2a93 Improve Cluster_worker.run doc
mtelvers commented 1 month ago

I created a Docker stack consisting of three instances of the scheduler and a Prometheus instance. Below are the highlights of the file

version: "3.7"
services:

  scheduler1:
    image: ocurrent/ocluster-scheduler:live@sha256:fec2c97deb974351fd11a97032a3bc37d26555e2e714352e50d703682147bb1b
...

  scheduler2:
    image: ocurrent/ocluster-scheduler:live@sha256:1d74ed0553a0e7bc7bd0f8783ab9f2c07cb5bf76f007b68fc330d9a5c6b53617
...

  scheduler3:
    image: ocurrent/ocluster-scheduler:live@sha256:a0f6fd8dca5e4d875cf97b030b037c0593ef67fbc0fe6594d65ae9b008cb2d9b
...

  prometheus:
    image: prom/prometheus
...

The three images are, in order, commit ec7f1e9, the head of the live-scheduler branch, commit d177823, the previously deployed version (tag v0.2.1) and commit be0460e which is PR#245 moving to the latest opam commit sha.

image

The graph shows ocaml_gc_heap_words, which continually increases with the current live-scheduler version and remains flat for the previous version and PR#245 version.

Therefore, I conclude that commit ec7f1e9 pulled in a bad dependency, which was subsequently fixed.