robur-coop / unipi

Serving content from a git repository via HTTPS (including let's encrypt provisioning) as MirageOS unikernel
63 stars 7 forks source link

Out of memory after 3 months of use #32

Open kit-ty-kate opened 2 months ago

kit-ty-kate commented 2 months ago

I deployed a unipi instance in early May, and a few days ago i tried to access it only to find this error message in the log when i realised the service was down:

[...]
2024-07-31T04:43:27-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:43:27-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:43:28-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:43:28-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:43:28-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:43:28-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:43:28-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:44:03-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:44:03-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:44:03-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:44:03-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:44:03-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:44:03-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:44:03-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:44:03-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:44:03-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
2024-07-31T04:44:03-00:00: [INFO] [tcp.segment] Max retransmits reached for connection - terminating
Fatal error: out of memory
Aborted
Solo5: solo5_abort() called
Solo5: Halted

Info:

hannesm commented 2 months ago

Thanks for your report. You don't have any detailed GC statistics, or do you?

My suspicion is that it is mainly the (old) TCP stack that in certain scenarios (with real-world Internet traffic) leaks memory. Unfortunately the new TCP stack has still some other issues and thus is not ready for primtle time yet.

An interesting data point would be how often you updated the data in the unikernel (via the /hook url, doing a git pull on the data repository) -- if at all. (The reason behind that question is to exclude the git-client from the considerations of memory usage.)

What I can say: thanks for testing, sorry that it behaves for you in that way, and we're working hard to get towards a more performant and less leaky stack. :)

kit-ty-kate commented 2 months ago

Thanks for your report. You don't have any detailed GC statistics, or do you?

Sadly not. Is there a way to get more information when this type of critical error happens? (something like an option gc-verbose=true so if it happens again we have more information)

My suspicion is that it is mainly the (old) TCP stack that in certain scenarios (with real-world Internet traffic) leaks memory. Unfortunately the new TCP stack has still some other issues and thus is not ready for primtle time yet.

No worries, whenever have an alpha version that ready to test i'll be happy to give the new stack a try.

An interesting data point would be how often you updated the data in the unikernel (via the /hook url, doing a git pull on the data repository) -- if at all. (The reason behind that question is to exclude the git-client from the considerations of memory usage.)

Unless some robot used /hook over and over, i've only used it once on July 7. To rule this out, it might be useful to have a password field on the /hook page so it isn't triggered by some random robot by accident or malice.

What I can say: thanks for testing, sorry that it behaves for you in that way, and we're working hard to get towards a more performant and less leaky stack. :)

No problem at all, thanks for creating unipi!

kit-ty-kate commented 2 months ago

Unless some robot used /hook over and over, i've only used it once on July 7. To rule this out, it might be useful to have a password field on the /hook page so it isn't triggered by some random robot by accident or malice.

actually, replying to myself here: i can simply set --hook=<some random password>. I'll do that on reboot

reynir commented 2 months ago

Thanks for your report. You don't have any detailed GC statistics, or do you?

Sadly not. Is there a way to get more information when this type of critical error happens? (something like an option gc-verbose=true so if it happens again we have more information)

There is the --enable-monitoring compile time flag. With that you can get metrics to influx including GC metrics. I realize this requires more setup on your part which may not be desirable for you.

hannesm commented 2 months ago

So the metrics won't suffice (as far as I know). It is more stuff like memtrace that would be useful -- but then we bite into the Cstruct.t apple and memtrace isn't too useful for these bigarray allocations.

Below the line, I think it would be great to get the new TCP stack out of the door, and spend more time on improving that.