openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
8 stars 1 forks source link

Memory Threshold exceeded #591

Closed mpass99 closed 4 months ago

mpass99 commented 4 months ago

Sentry Issue: POSEIDON-4H

Memory Threshold exceeded

Last Memory Issue: #453

mpass99 commented 4 months ago

As a preliminary result, I would increase our configured memory threshold (1GB -> 2GB) as the memory usage seems valid. About a minute after each Memory Threshold exceeded warning the Memory Threshold no longer exceeded information is printed. This can be explained by the Garbage Collector cycles.

Let's confirm that the memory usage is valid by checking in the monitoring for memory spikes (once either InfluxDB or the Telegraf Dashboard is accessible again).

mpass99 commented 4 months ago

In our Telegraf Dashboard, we see that the memory usage likely correlates with Poseidon's usage volume and includes spikes. However, the spikes do not exceed 2GB and flatten without external intervention.

Memory Poseidon 13-05-2024

When analyzing the two events of 13-05-2024, we notice that both the HeapAlloc with 1101 MB (the metric that we compare against our threshold) and HeapInuse with 1109 MB exceed our threshold of 1000 MB. These metrics are part of Golang's memstats information [1].
For a cause analysis, we use Golang's runtime MemProfile [2]. Here, the total in-use heap memory is 400 MB [3]. Sentry caused over 96% of the memory usage in both events of that day.

As mentioned before, I would go with increasing the configured memory threshold. An alternative might be to decrease Sentry's Sample Rate.

MrSerth commented 4 months ago

Thanks for analyzing. I am fine with increasing the threshold, but still think we should also find to tackle one of the "root causes" to find a permanent solution.

With regard to Sentry, I think a memory usage of 96% seems way too high, close to being unacceptable. Do we have any information why the memory usage is so high for Sentry? Are events not correctly forward to the relay? Or are some data kept in memory once submitted successfully? Or do we need to submit traces more often to reduce the memory footprint? Or shall we disable some information, such as version information of all dependencies (these are usually included in a Sentry report, at least on Ruby)?

What I am trying to say: I am not really convinced whether the large memory footprint is legit given the expected details collected per request and the number of users we recently had, even when sampling all requests (as we do right now).

mpass99 commented 4 months ago

Good call spending more time into this 👍

Issue 1 When analyzing the stack trace of the heap profile we see that almost all of the Sentry memory is used in one line of the Sentry SDK. This line seems to be responsible for marshaling the profiling data into a buffer.

Issue 2 For estimating the data volume, we can look into the relay's nginx access log. Here, we can roughly count that Poseidon sends about 450 Bytes per second mostly in 41 Bytes envelopes. From time to time, Poseidon sends 183 Byte-sized packets that get rejected with the status code 413. We know this symptom already from #428, where we added mitigation that is still in place (but obviously not stopping the issue).

Issue 1 This, however, is orders of magnitude away from the almost 400 MB that the marshaling is consuming. We should identify the issue here. Does the Nginx Access Log write MB instead of Bytes [4]? Where is the data gone? Compressed? Should we properly count the data arriving at the relay?

Issue 3 When broadening our view, we see not only the daytime spikes (the green issue) but also an increase of 40 MiB per day in memory usage (the red issue).

Poseidon Memory Last Month

To find the root cause here, we have a look at the memory profile from the 28th of April and from the 7th of May.

Issue 3.1 We find that we have 134 (66%) more Goroutines monitoring some local storage. We have to verify if, at this moment, we also had 66% more (environments/)runners. Otherwise, this might be a slow Goroutine leak.

Issue 3.2 The heap profile shows that both Sentry and Sentry-unrelated data have increased.

28.04. 07.05 Increase
Sentry 298 MB 466 MB 56%
Rest 8 MB 16 MB 100%

While the Sentry-unrelated memory usage has increased stronger, its dimension is more vulnerable to randomness and by far no factor for the increase of 362 MiB (even if this number has to be cut half [3]).

The increase in the Sentry memory usage can be explained by each marshaling requiring more memory (165 MB in total) and 28 more profileRecorders (2 MB).

MrSerth commented 4 months ago

Thanks, that's very useful! And I like the categorization as well as visualization very much; really helpful!

Issue 1 The json.Marshal is a method originating from the standard library. Hence, I would assume this works correctly. But why is the memory kept / blocked there?

Issue 2 I doubt that the 41 bytes are those being sent by Poseidon. Rather, it's the size being received by Poseidon as a response. Analyzing with Wireshark (and double-checking the meaning of the Nginx log) reveals that. Our Nginx log is configured with the default, thus showing the $body_bytes_sent. The 41 bytes returned are just the ID:

{"id":"c7f60ff3cd4542cb841450effcf97c1b"}

2 byte responses (which might also occur for web-based sessions) are just returning:

{}

I haven't observed a response with code 413 during my short investigation, so that this is left for you :). My recommendation: Try to gather more insights on the data being sent and the error case. You may adjust the nginx logging to include the request (body) size and/or do some Wireshark tracing. Since our internal network traffic is not HTTPS encrypted, you'll be able to read it. Only once we have an example of a payload being too large, I would further increase the client_max_body_size (it should be at 50MB currently, not tested).

Issue 1 (No. 2) See previous answer; some recording of data arriving might be useful. And, as far as I understood, Nginx is logging in Bytes.

Issue 3 A 40 MiB increase is not too worrying, but still it should not happen. Thanks for looking here, too.

Issue 3.1 Yes, this could be some Goroutine leakage. From my experience, we currently have an increased usage during days only, but almost no usage during nights. This means that I would expect each night to have only the pre-warming pool running, which would be equal each night. Otherwise: which local storage monitoring do you mean?

Issue 3.2 Is this increase linked to the increased number of Goroutines? And what are these profileRecorders (are they related to the profiling we do for PGO)?

mpass99 commented 4 months ago

Issue 1 The json.Marshal is a method originating from the standard library. Hence, I would assume this works correctly. But why is the memory kept / blocked there?

Thank you! This question cleared my misunderstanding that in contrast to goroutines, the memory stack traces represent not the place where the data is currently processed but where it was first allocated.
So we know that we have a lot transactionProfile data with up to 201 MB. This leads to two questions:

  1. Is it ok that the Sentry Profiling data consumes a buffer this size for a single request from CodeOcean?
  2. In the time of the analyzed memory dump of the 7th Mai, we had about 53 Executions per minute and we had 6 transactionProfile data allocations. Let's assume that we can just focus on the WebSocket request of each execution. So if at each moment 6 WebSocket requests are being processed, each one takes about 7 seconds. Is this latency (and this rough calculation) ok?

Issue 2 Thanks for enlightening me in regard to the body_bytes_sent. Currently, we have only 3 requests with the status code 413 per day. Therefore, the tcpdump would outgrow us before we can catch one occurrence. We might reconfigure nginx [5] [6]?!

However, as we are allocating 200 MB just before sending it, I would assume to see request body sizes of the same dimension.

Issue 3

At some point, we might separate this issue into a separate GitHub issue as we have different priorities, and the investigation seems to steer in a different direction.

Issue 3.1 Yes, we expect to have the same number of Goroutines each night. Currently, the data is not strongly reliable as we do not continuously monitor the goroutines, but only when the memory threshold is exceeded.

The goroutines of our local storage are started on Poseidon startup or for each new environment and new runner. https://github.com/openHPI/poseidon/blob/48f32282e314a34bfd5140f4acfb27ee8f4d6ae4/pkg/storage/storage.go#L85

Due to our limited data, we have to mind the number of runners (and environments) for the time of the data (goroutine dump).
In the above case, we had 12 environments, 49 idle runners, and 24 used runners on the 28th of April which should result in 90 (incl 5 static) goroutines for storages. However, we had 202.
On the 7th of May, we had 12 environments, 59 idle runners, and 39 used runners which should result in 115 goroutines - we had 336.
We can conclude that our understanding of Poseidon's Goroutines does not yet fully match the actual Goroutines.

Issue 3.2

Is this increase linked to the increased number of Goroutines? And what are these profileRecorders (are they related to the profiling we do for PGO)?

The profileRecorder is Sentry's object responsible for collecting and processing the profiling data every tick. I have to correct myself that we do not have 28 more profileRecorders but 28 more data allocations inside the (likely one) profileRecorder.
The increase is likely linked to the increased number of Goroutines because Sentry collects the StackTrace of all Goroutines. Because we have more Goroutines, more data gets allocated. As mentioned before, this accounts only for an increase of 2 MB (3.2 MB in total). Far more memory is consumed by Sentry's transactionProfile (whatever might be in there).
The profileRecorder is only as far related to the PGO as both rely on the data from Go's runtime package.

MrSerth commented 4 months ago

Issue 1

Thanks for providing more details here.

  1. I feel that 200 MB per performance profile really seems to be a lot. Can we reproduce this behavior, e.g., by checking an isolated request on staging? I would just like to confirm that we are not mixing something up...
  2. Which latency are you referring to?

Issue 2 We could, temporarily, reconfigure nginx (I would probably not do it through Ansible, since we don't want this change to be persisted). Otherwise, we can just attach a large disk and capture there....

Regarding the size: When I check last, most HTTP requests were not in the size of 200 MB, but we might have singular traces that large.

Issue 3 Yes, we could extract this into a dedicated issue (to simplify discussion in this one, too).

Issue 3.1 Can we add some further tracing to the number of Goroutines (and their origin, ...)? This might also assist in finding the memory issues, I would assume.

Issue 3.2 Placeholder

mpass99 commented 4 months ago

Issue 3.1 Since the Goroutines are stuck in the monitoring loop, we can identify the runners that keep sending periodic updates but have not been used that day. In the case of the 7th of May, we could identify 121 such runners.

Next, we will analyze the log entries for some of these runners to identify reasons why the runner (and monitoring) context has not been stopped aka why the runner has not been destroyed (at the latest by the Inactivity Timer).

mpass99 commented 4 months ago

The logs reveal two different issues leading to memory leaks.
The simpler one is addressed in #601, and the other in the separate issue #602.

MrSerth commented 4 months ago

Issue 2 I made some progress with Nginx and shared the log file privately with you. The main issue is that the content length (JSON request sent) was 60236494 bytes (60,24 MB), which is above our 50MB limit configured. From my impression (on the incomplete request), the issue could be caused by the large memory footprint, too.

mpass99 commented 4 months ago

We conclude that the interplay between Goroutine leaks within Poseidon and the usage of Sentry caused the issue.

We identified two causes of Goroutine leaks (#601, #602). However, they were leaking so slowly that the direct impact did not cause any issues. Instead, Sentry caused high memory usage in response to the Goroutine Leaks. To be precise, Sentry's Profiling Feature collects the stack traces of all Goroutines at a fine sampling rate. Therefore, the memory usage of the profiling depends mostly on the sampling rate (that has no documented configuration option), the number of Goroutines, and the variability of the Goroutine Stacks.

With the two action points taken, we will reduce Goroutines and memory usage, and close this investigational issue.