research-software-directory / RSD-as-a-service

This repo contains the new RSD-as-a-service implementation
https://research.software
24 stars 14 forks source link

Unexpected RSD outage #1084

Closed jmaassen closed 8 months ago

jmaassen commented 9 months ago

We've had an unexpected RSD outage around 2024/01/11 at 19:40 UTC. After signaling a high CPU usage, the VM became completely unreachable, both via http and ssh. After rebooting the VM, the RSD worked again as expected.

I'll collect some observations in this issue to try and reconstruct what went wrong.

jmaassen commented 9 months ago

Some observations:

The oom-killer seems to indicate that a process is requesting more memory that available. When this happens, the oom-killer will kill a (?) process to make memory available. Depending on the configuration this may be the offending process or another one (I think, see here: https://stackoverflow.com/questions/9199731/understanding-the-linux-oom-killers-logs)

jmaassen commented 9 months ago

Looking into the scraper logs I see the following:

scrapers         |^[[0m 2024-01-11T11:31:57.897551301Z [176.536s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers         |^[[0m 2024-01-11T11:31:57.898039320Z [176.536s][warning][os,thread] Failed to start the native thread for java.lang.Thread "HttpClient-2423-Worker-1"
scrapers         |^[[0m 2024-01-11T11:31:57.901134197Z [176.540s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers         |^[[0m 2024-01-11T11:31:57.901786604Z [176.540s][warning][os,thread] Failed to start the native thread for java.lang.Thread "Thread-0"
scrapers         |^[[0m 2024-01-11T11:31:57.902254622Z [176.541s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers         |^[[0m 2024-01-11T11:31:57.902845434Z [176.541s][warning][os,thread] Failed to start the native thread for java.lang.Thread "HttpClient-2423-Worker-2"
scrapers         |^[[0m 2024-01-11T11:31:57.903193247Z [176.542s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers         |^[[0m 2024-01-11T11:31:57.903902717Z [176.542s][warning][os,thread] Failed to start the native thread for java.lang.Thread "Thread-1"
scrapers         |^[[0m 2024-01-11T11:31:57.904043343Z [176.543s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0>
scrapers         |^[[0m 2024-01-11T11:31:57.904614106Z [176.543s][warning][os,thread] Failed to start the native thread for java.lang.Thread "HttpClient-2424-SelectorManager"
scrapers         |^[[0m 2024-01-11T11:31:57.905279175Z Exception in thread "main" java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
scrapers         |^[[0m 2024-01-11T11:31:57.908991580Z    at java.base/java.lang.Thread.start0(Native Method)
scrapers         |^[[0m 2024-01-11T11:31:57.909027381Z    at java.base/java.lang.Thread.start(Unknown Source)
scrapers         |^[[0m 2024-01-11T11:31:57.909075224Z    at java.net.http/jdk.internal.net.http.HttpClientImpl.start(Unknown Source)
scrapers         |^[[0m 2024-01-11T11:31:57.909150650Z    at java.net.http/jdk.internal.net.http.HttpClientImpl.create(Unknown Source)
scrapers         |^[[0m 2024-01-11T11:31:57.909181607Z    at java.net.http/jdk.internal.net.http.HttpClientBuilderImpl.build(Unknown Source)
scrapers         |^[[0m 2024-01-11T11:31:57.909245375Z    at java.net.http/java.net.http.HttpClient.newHttpClient(Unknown Source)
scrapers         |^[[0m 2024-01-11T11:31:57.909292608Z    at nl.esciencecenter.rsd.scraper.Utils.getAsAdmin(Utils.java:112)
scrapers         |^[[0m 2024-01-11T11:31:57.909525631Z    at nl.esciencecenter.rsd.scraper.doi.PostgrestMentionRepository.save(PostgrestMentionRepository.java:89)
scrapers         |^[[0m 2024-01-11T11:31:57.909560752Z    at nl.esciencecenter.rsd.scraper.doi.MainCitations.main(MainCitations.java:35)

It seems that a scraper is trying to create a new thread and fails. This may be caused by scrapers "leaking" threads due to this: https://sonarcloud.io/project/issues?resolved=false&id=nl.research-software%3Ascrapers&open=AYtmqkdRye23VlELZ_NO