openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
9 stars 1 forks source link

Sentry Integration causes OOM kill #540

Open MrSerth opened 9 months ago

MrSerth commented 9 months ago

Today, we had another occurrence of Poseidon being OOM killed.

Stack Trace

``` Jan 24 08:27:45 poseidon-terraform poseidon[624]: heap profile: 194: 751991664 [9538813: 1723733127104] @ heap/1048576 Jan 24 08:27:45 poseidon-terraform poseidon[624]: 1: 83886080 [41: 3439329280] @ 0x4a9d4e 0x4a977d 0x4a9959 0x719b1d 0x71aade 0x71c3b6 0x71beef 0x71aade 0x71c70f 0x71aade 0x71c70f 0x717fd3 0x717b99 0x717490 0x78e7d9 0x78ea3c 0x78f889 0x7784d2 0x 77bb0d 0x77baf7 0x78b2fd 0x47b8df 0x78a5a5 0x78a56f 0xe46015 0x6e6a69 0x6e974e 0x6e5634 0x470781 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x4a9d4d#011bytes.growSlice+0x8d#011#011#011#011#011#011/usr/lib/go-1.21/src/bytes/buffer.go:249 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x4a977c#011bytes.(*Buffer).grow+0x13c#011#011#011#011#011/usr/lib/go-1.21/src/bytes/buffer.go:151 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x4a9958#011bytes.(*Buffer).Write+0x58#011#011#011#011#011/usr/lib/go-1.21/src/bytes/buffer.go:179 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x719b1c#011encoding/json.uintEncoder+0x13c#011#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:525 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71aadd#011encoding/json.structEncoder.encode+0x21d#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:706 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71c3b5#011encoding/json.arrayEncoder.encode+0xd5#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:849 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71beee#011encoding/json.sliceEncoder.encode+0x30e#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:822 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71aadd#011encoding/json.structEncoder.encode+0x21d#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:706 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71c70e#011encoding/json.ptrEncoder.encode+0x20e#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:878 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71aadd#011encoding/json.structEncoder.encode+0x21d#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:706 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71c70e#011encoding/json.ptrEncoder.encode+0x20e#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:878 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x717fd2#011encoding/json.(*encodeState).reflectValue+0x72#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:323 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x717b98#011encoding/json.(*encodeState).marshal+0xb8#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:295 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71748f#011encoding/json.Marshal+0xcf#011#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:162 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78e7d8#011github.com/getsentry/sentry-go.envelopeFromBody+0x518#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/transport.go:196 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78ea3b#011github.com/getsentry/sentry-go.getRequestFromEvent+0xfb#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/transport.go:232 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78f888#011github.com/getsentry/sentry-go.(*HTTPTransport).SendEvent+0xc8#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/transport.go:359 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x7784d1#011github.com/getsentry/sentry-go.(*Client).processEvent+0x2b1#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/client.go:637 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x77bb0c#011github.com/getsentry/sentry-go.(*Client).CaptureEvent+0x6c#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/client.go:447 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x77baf6#011github.com/getsentry/sentry-go.(*Hub).CaptureEvent+0x56#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/hub.go:224 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78b2fc#011github.com/getsentry/sentry-go.(*Span).doFinish+0x15c#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/tracing.go:356 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x47b8de#011sync.(*Once).doSlow+0xbe#011#011#011#011#011/usr/lib/go-1.21/src/sync/once.go:74 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78a5a4#011sync.(*Once).Do+0x44#011#011#011#011#011#011/usr/lib/go-1.21/src/sync/once.go:65 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78a56e#011github.com/getsentry/sentry-go.(*Span).Finish+0xe#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/tracing.go:205 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110xe46014#011main.initServer.(*Handler).Handle.(*Handler).handle.func1+0x314#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/http/sentryhttp.go:119 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x6e6a68#011net/http.HandlerFunc.ServeHTTP+0x28#011#011#011#011/usr/lib/go-1.21/src/net/http/server.go:2136 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x6e974d#011net/http.serverHandler.ServeHTTP+0x8d#011#011#011#011/usr/lib/go-1.21/src/net/http/server.go:2938 Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x6e5633#011net/http.(*conn).serve+0x5f3#011#011#011#011#011/usr/lib/go-1.21/src/net/http/server.go:2009 ```


At the same time, Influx also showed erroneous behaviour:

Stack Trace

``` Jan 24 08:27:25 poseidon-terraform poseidon[624]: 2024/01/24 08:27:25 influxdb2client E! Write error: internal error: unexpected error writing points to database: timeout, batch kept for retrying Jan 24 08:27:25 poseidon-terraform poseidon[624]: time="2024-01-24T08:27:25.603586Z" level=debug code=200 duration=15.765001853s method=GET path=/api/v1/runners/29-9a31d753-ba91-11ee-8c54-fa163e351aea/files user_agent="Faraday v2.9.0" Jan 24 08:27:25 poseidon-terraform poseidon[624]: time="2024-01-24T08:27:25.604447Z" level=debug code=200 duration=16.191098103s method=GET path=/api/v1/runners/29-9a31d753-ba91-11ee-8c54-fa163e351aea/websocket user_agent= Jan 24 08:27:37 poseidon-terraform telegraf[632]: 2024-01-24T08:27:37Z E! [outputs.influxdb_v2] When writing to [http://monitoring-terraform.compute.internal:8086]: Post "http://monitoring-terraform.compute.internal:8086/api/v2/write?bucket=tele graf&org=codeocean": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Jan 24 08:27:37 poseidon-terraform telegraf[632]: 2024-01-24T08:27:37Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s) Jan 24 08:27:38 poseidon-terraform poseidon[624]: 2024/01/24 08:27:38 influxdb2client E! Write error: internal error: unexpected error writing points to database: timeout, batch kept for retrying ```

MrSerth commented 9 months ago

We decided to focus on the Sentry part first (postponing InfluxDB). Maybe, the OOM kill is related to the buffer of Sentry events awaiting transmission to the relay. Thus, it could be well connected to the high memory threshold experience during network outages.

mpass99 commented 7 months ago

WIP Questions to investigate:

We agreed to postpone this issue until the next occurrence.

mpass99 commented 4 weeks ago

According to POSEIDON-7A the issue occurred again. This time, Poseidon has not been OOM killed. However, the Heap profile again indicates a high memory usage of the Sentry integration.

MrSerth commented 3 weeks ago

You're right, the issue occurred once again. I checked the Sentry Relay and noticed that it was complaining about too little RAM (3899297792 / 4098719744 (95.13% >= 95.00%)). I've fixed at least this issue now by increasing the RAM from 4 GB to 8 GB for the instance and will try to observe the situation by now. Hence, it seems like the relay wasn't able to receive the further information, thus causing issues with Poseidon (where the collected data was still hold in a queue).

MrSerth commented 4 days ago

During the remaining duration of the recent Python course, we only noticed one other occurrence of the issue. For this second occurrence on October 16th, I was able to gather additional insights: Sentry had a (major) outage, causing issues with the delivery of the events upstream. Hence, our Sentry Relay collected all events which couldn't be sent. After some time, when the Sentry relay was hitting the memory limit once again, Poseidon also reported a too high memory usage (but no OOM kill).

My suspicion: We only see a high memory usage of the Sentry integration when events aren't forwarded correctly (either on the Poseidon host or the Sentry Relay host causing a backlog).