Investigating incident on a server with http.zig

zigster64 commented 1 year ago

NOT reporting this as a bug in http.zig - just an observation of some weird behaviour in my app that uses http.zig. Might be a worthy datapoint if you are having any weird resource usage on your app as well.

Running a server on the open internet using http.zig. It is similar to z-chat, but a bit more complicated internally.

It normally averages really low CPU and reasonably flat memory usage, that grows a little bit each 24hr period. Its been up for around a week with no noticable issues.

Incident the other day that pegged the CPU at 100% + what appears to be a memory leak. There is no corresponding flood of external network traffic around that same time. hmmm.

Im using GPA as the allocator.

So - most of the requests should be operating under an arena, and I do expect to have 1 extra thread per client to do the SSE conversation. All of the SSE connections use timed waits + a keepalive write, so if the client end walks away, each SSE thread should catch a write error and cleanup itself.

So not expecting much in the way of memory leaks in http.zig, mainly because of the arena.

The application is pretty simple - does a lot of res.body = someLargeArrayOfBytes ... which I expect would cleanup properly on response end due to the arena as well.

There is no filesystem IO - everything is handled in memory, using static content courtesy of @embedFile

99% the problem is my app code of course ... will dive in and try to diagnose.

Interested to see how your public server is going with respect to resource usage. Thx

karlseguin commented 1 year ago

One possible issue is that the arena doesn't get reset until the connection ends. For a keepalive connection, that could lead to a sort of leak. This could be particularly problematic if you're using a reverse proxy with a long timeout (e.g. nginx's keepalive_timeout setting).

I've changed the code for the arena to be reset for each request even within a keepalive connection. This was originally excluded because it would segfault, and I could never figure out why. There does seem to have been fixes to the Arena allocator since then (e.g. https://github.com/ziglang/zig/commit/5d3c8f4913884a4503e9f183e471b6090bb5bc92) and I can't reproduce the original issue. So it's hopefully safe to re-enable. Over a keepalive connection, this should result in even flatter memory usage.

For my own app, I'm unfortunately using a lot of caching, and caching just looks like a memory leak. I've seen the memory grow by very small amounts, but I can't tell if that's just the cache or something else. Also, I've been deploying regularly which also tends to mask any issue.

zigster64 commented 1 year ago

Nice to know about the zig arena update - didnt know about that.

I can afford long release cycles on this app, so Ill add a tonne of logging / including memory allocation logging ... and run it up for a week, see what I can find. Will report back by the end of the week

I would like to blame a malicous 3rd party spam meister for the CPU spike, but there is no evidence for that. It would be in my app somewhere (I am admittedly doing some nasty things with clocks and mutexes, its easily to make a mistake there)

Thx for the quick reply too !

zigster64 commented 12 months ago

hmmm

I have some handler code that does things like this -

w = res.writer(); w.writeAll(@embedFile("audio/some_data.mp3"));

where len of some_data.mp3 is > 32k (being the default buffer size).

So next build, when I init the server, Im going to default the response.buffer_size = big enough to hold the largest static file that im @embedding, and see if that makes any difference.

Also taking that opportunity to cut down the max req body size to something really small, to reduce malicous pkts. Its nice having these variables under tight control to match my expected data.

Note - its now thursday, and ... I saw a CPU spike earlier in the week, but it hasnt hit again yet. Of course not - bugs never seem to manifest when you are deliberately watching for them !

zigster64 commented 12 months ago

Im seeing the CPU spike on a regular basis now - but it appears to take 2-3 days to happen. Thats pretty annoying :)

Cant replicate this on local, just on AWS.

I have put up a PR that might cleanup some peripheral issues with handling brokenPipe errors - hoping this fixes it, but its extremely hard to pin down whether this is the source of my problem or not.

karlseguin commented 12 months ago

I've looked into this some more using both Zig's leak detector and Valgrind. I focused on using the writer with data larger than the initial buffer. I can't reproduce a leak.

Your code isn't available online, is it ?

zigster64 commented 12 months ago

running with that "fix" on brokenPipe - AWS appears pretty good so far. I doubt that small change fixes the CPU spike I saw before, but who knows ? I cant reproduce on local either, its only on AWS.

But looks good at the moment :

Ive cleaned up my app code a bit, and synched http.zig against your latest updates. Will do a launch of the app now, and open the floodgates to the Zig public via discord. Might see a lot more user activity over the weekend.

Also - taken the opp to build everything against zig 0.11.0 official ... so seems like a good time to do a release

zigster64 commented 11 months ago

Interesting ... got the CPU spike again

application logs show that sometime before the CPU spike, the SSE handler (the one with the forever-open response that gets periodically written to), gets an error.Unexpected when writing to the client

sounds like another unpolished rough edge in stdlib if that’s the case.

Will try adding that to the list of error types that bypass writing the response. (brokenPipe etc)

There is no matching memory leak spike with the CPU spike. Is very odd !

zigster64 commented 11 months ago

OK, going to half-close this one

The CPU spike is fixed after doing the work to back out of processing requests if there is a write error.

This suggests that what might have been happening is something like this pseudocode

fn writeAll(string) {
  while lengthWritten < len(string) {
    lengthWritten += try write(string[lengthWritten..])
  }
}

and something happening in the low level implementation of write() in the stdlib that returns 0 bytes written instead of an error type. That would cause the thread to thrash a CPU core like we have observed. Can't for the life of me see any such code in the stdlib that would do that.

So that bit about CPU thrashing is all fixed now.

With the "memory leak" - that's a really weird one. Instrumenting the heck of this on Mac, it 100% isnt leaking. I can see mallocs happening when a thread is spawned and detached, and I can see correct frees happening when the thread terminates. Run in for days on local, hammer it hard with test rigs, and it's stable. Memory usage hovers around the 10MB range for days. Very impressive actually.

Looking through zig stdlib, the implementation of the pthread wrapper code between Mac (Posix) and Linux is very very different. The Linux code dives off into embedded ASM that I don't understand yet. The production site that is showing the memory leak is Amazon Linux on aarch64 ... so maybe this is some edge case in the zig stdlib pthread wrapper for linux + arm ?

Running some experiments to prove that :

Run on AWS x86 instance and see if that leaks
Run on ARM server under FreeBSD and see if that leaks

That might take another week yet

zigster64 commented 11 months ago

Closing this now

I have added memory logging to my app, so every http event, it logs rusage stats, and also prints a diff to show how many bytes have been used after the handler completes.

there is no leak to be seen in the app - and there is no matching activity in the app when aws reports a jump in memory consumption either.

if the container itself is growing linearly over time (albeit by tiny amounts), it’s not the zig application doing it.

Thx for your patience, and sry if that raised any false concerns. At least we have good proof now that the lib and my app is really watertight :)

btw - the logs show a fairly constant rain of script kiddie attacks coming in, and they have zero bad effects. Because there is no file io at all, it’s pretty hack proof.

karlseguin / http.zig

Investigating incident on a server with http.zig #13