Closed jmaassen closed 8 months ago
Some observations:
systemd invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
...
Out of memory: Killed process 490603 (java) total-vm:5151332kB, anon-rss:620256kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:7396kB oom_score_adj:0
The oom-killer seems to indicate that a process is requesting more memory that available. When this happens, the oom-killer will kill a (?) process to make memory available. Depending on the configuration this may be the offending process or another one (I think, see here: https://stackoverflow.com/questions/9199731/understanding-the-linux-oom-killers-logs)
Looking into the scraper logs I see the following:
scrapers |^[[0m 2024-01-11T11:31:57.897551301Z [176.536s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers |^[[0m 2024-01-11T11:31:57.898039320Z [176.536s][warning][os,thread] Failed to start the native thread for java.lang.Thread "HttpClient-2423-Worker-1"
scrapers |^[[0m 2024-01-11T11:31:57.901134197Z [176.540s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers |^[[0m 2024-01-11T11:31:57.901786604Z [176.540s][warning][os,thread] Failed to start the native thread for java.lang.Thread "Thread-0"
scrapers |^[[0m 2024-01-11T11:31:57.902254622Z [176.541s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers |^[[0m 2024-01-11T11:31:57.902845434Z [176.541s][warning][os,thread] Failed to start the native thread for java.lang.Thread "HttpClient-2423-Worker-2"
scrapers |^[[0m 2024-01-11T11:31:57.903193247Z [176.542s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers |^[[0m 2024-01-11T11:31:57.903902717Z [176.542s][warning][os,thread] Failed to start the native thread for java.lang.Thread "Thread-1"
scrapers |^[[0m 2024-01-11T11:31:57.904043343Z [176.543s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers |^[[0m 2024-01-11T11:31:57.904614106Z [176.543s][warning][os,thread] Failed to start the native thread for java.lang.Thread "HttpClient-2424-SelectorManager"
scrapers |^[[0m 2024-01-11T11:31:57.905279175Z Exception in thread "main" java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
scrapers |^[[0m 2024-01-11T11:31:57.908991580Z at java.base/java.lang.Thread.start0(Native Method)
scrapers |^[[0m 2024-01-11T11:31:57.909027381Z at java.base/java.lang.Thread.start(Unknown Source)
scrapers |^[[0m 2024-01-11T11:31:57.909075224Z at java.net.http/jdk.internal.net.http.HttpClientImpl.start(Unknown Source)
scrapers |^[[0m 2024-01-11T11:31:57.909150650Z at java.net.http/jdk.internal.net.http.HttpClientImpl.create(Unknown Source)
scrapers |^[[0m 2024-01-11T11:31:57.909181607Z at java.net.http/jdk.internal.net.http.HttpClientBuilderImpl.build(Unknown Source)
scrapers |^[[0m 2024-01-11T11:31:57.909245375Z at java.net.http/java.net.http.HttpClient.newHttpClient(Unknown Source)
scrapers |^[[0m 2024-01-11T11:31:57.909292608Z at nl.esciencecenter.rsd.scraper.Utils.getAsAdmin(Utils.java:112)
scrapers |^[[0m 2024-01-11T11:31:57.909525631Z at nl.esciencecenter.rsd.scraper.doi.PostgrestMentionRepository.save(PostgrestMentionRepository.java:89)
scrapers |^[[0m 2024-01-11T11:31:57.909560752Z at nl.esciencecenter.rsd.scraper.doi.MainCitations.main(MainCitations.java:35)
It seems that a scraper is trying to create a new thread and fails. This may be caused by scrapers "leaking" threads due to this: https://sonarcloud.io/project/issues?resolved=false&id=nl.research-software%3Ascrapers&open=AYtmqkdRye23VlELZ_NO
We've had an unexpected RSD outage around 2024/01/11 at 19:40 UTC. After signaling a high CPU usage, the VM became completely unreachable, both via http and ssh. After rebooting the VM, the RSD worked again as expected.
I'll collect some observations in this issue to try and reconstruct what went wrong.