Unexpected RSD outage - Githubissues

jmaassen commented 9 months ago

We've had an unexpected RSD outage around 2024/01/11 at 19:40 UTC. After signaling a high CPU usage, the VM became completely unreachable, both via http and ssh. After rebooting the VM, the RSD worked again as expected.

I'll collect some observations in this issue to try and reconstruct what went wrong.

jmaassen commented 9 months ago

Some observations:

the monitoring from aws shows the VM 'flatlining' at 19:40 UTC. Network graphs fall to almost zero, CPU utilization drops to a stable 4%
the scrapers seem to still run but (only?) report network exceptions (i.e., the network is unreachable).

in the time leading up to the outage, the syslog report the oom-killer being run:

systemd invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
...
Out of memory: Killed process 490603 (java) total-vm:5151332kB, anon-rss:620256kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:7396kB oom_score_adj:0

The oom-killer seems to indicate that a process is requesting more memory that available. When this happens, the oom-killer will kill a (?) process to make memory available. Depending on the configuration this may be the offending process or another one (I think, see here: https://stackoverflow.com/questions/9199731/understanding-the-linux-oom-killers-logs)

jmaassen commented 9 months ago

Looking into the scraper logs I see the following:

scrapers         |^[[0m 2024-01-11T11:31:57.897551301Z [176.536s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers         |^[[0m 2024-01-11T11:31:57.898039320Z [176.536s][warning][os,thread] Failed to start the native thread for java.lang.Thread "HttpClient-2423-Worker-1"
scrapers         |^[[0m 2024-01-11T11:31:57.901134197Z [176.540s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers         |^[[0m 2024-01-11T11:31:57.901786604Z [176.540s][warning][os,thread] Failed to start the native thread for java.lang.Thread "Thread-0"
scrapers         |^[[0m 2024-01-11T11:31:57.902254622Z [176.541s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers         |^[[0m 2024-01-11T11:31:57.902845434Z [176.541s][warning][os,thread] Failed to start the native thread for java.lang.Thread "HttpClient-2423-Worker-2"
scrapers         |^[[0m 2024-01-11T11:31:57.903193247Z [176.542s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers         |^[[0m 2024-01-11T11:31:57.903902717Z [176.542s][warning][os,thread] Failed to start the native thread for java.lang.Thread "Thread-1"
scrapers         |^[[0m 2024-01-11T11:31:57.904043343Z [176.543s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers         |^[[0m 2024-01-11T11:31:57.904614106Z [176.543s][warning][os,thread] Failed to start the native thread for java.lang.Thread "HttpClient-2424-SelectorManager"
scrapers         |^[[0m 2024-01-11T11:31:57.905279175Z Exception in thread "main" java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
scrapers         |^[[0m 2024-01-11T11:31:57.908991580Z    at java.base/java.lang.Thread.start0(Native Method)
scrapers         |^[[0m 2024-01-11T11:31:57.909027381Z    at java.base/java.lang.Thread.start(Unknown Source)
scrapers         |^[[0m 2024-01-11T11:31:57.909075224Z    at java.net.http/jdk.internal.net.http.HttpClientImpl.start(Unknown Source)
scrapers         |^[[0m 2024-01-11T11:31:57.909150650Z    at java.net.http/jdk.internal.net.http.HttpClientImpl.create(Unknown Source)
scrapers         |^[[0m 2024-01-11T11:31:57.909181607Z    at java.net.http/jdk.internal.net.http.HttpClientBuilderImpl.build(Unknown Source)
scrapers         |^[[0m 2024-01-11T11:31:57.909245375Z    at java.net.http/java.net.http.HttpClient.newHttpClient(Unknown Source)
scrapers         |^[[0m 2024-01-11T11:31:57.909292608Z    at nl.esciencecenter.rsd.scraper.Utils.getAsAdmin(Utils.java:112)
scrapers         |^[[0m 2024-01-11T11:31:57.909525631Z    at nl.esciencecenter.rsd.scraper.doi.PostgrestMentionRepository.save(PostgrestMentionRepository.java:89)
scrapers         |^[[0m 2024-01-11T11:31:57.909560752Z    at nl.esciencecenter.rsd.scraper.doi.MainCitations.main(MainCitations.java:35)

It seems that a scraper is trying to create a new thread and fails. This may be caused by scrapers "leaking" threads due to this: https://sonarcloud.io/project/issues?resolved=false&id=nl.research-software%3Ascrapers&open=AYtmqkdRye23VlELZ_NO

research-software-directory / RSD-as-a-service

Unexpected RSD outage #1084