matrix-org / dendrite

Dendrite is a second-generation Matrix homeserver written in Go!
https://matrix-org.github.io/dendrite/
Apache License 2.0
5.67k stars 664 forks source link

NATS JetStream startup message indicates more memory available than soft memory limit set by GOMEMLIMIT #3436

Open paigeadelethompson opened 3 hours ago

paigeadelethompson commented 3 hours ago

​ The Go authors explicitly label GOMEMLIMIT a "soft" limit. That means that the Go runtime does not guarantee that the memory usage will exceed the limit. Instead, it uses it as a target.

Just to be clear I'm not missing that point at all nor should anybody else misunderstand. It actually does fairly good job of minimizing the amount of memory it will try to use but there are some loose ends in Dendrite that will eventually run away and the system (having 2GB total) will ultimately become unusable. It's also not unimaginable that a glimpse into a profiler (maybe pprof) might reveal a couple of things that could be made to work more efficiently.

I really want to say yes, because if nothing else you can limit the work load and refuse work, preempt less important work and as near as I can tell to some extent it already does this but I think the thing that is missing is either a backlog or just outright refusing work that the process doesn't have the capacity for--given a soft memory limit.

Essentially what I have to do because of this is set a hard limit, sort of like an "upper maximum" that will ensure that the process is OOM killed when it finally runs away and makes the whole system unusable. I still believe that Dendrite itself should be able to make the best use of what it has available to it and refuse work gracefully, in general this isn't a hobby that a lot of people are willing to pay money for if even a little so you can reasonably expect most of the people who are going to be using this to want to run on the absolute minimum set of requirements. I gather there are some people who are working on this project who in fact tell you that they're running on just 2GB of memory without a SOFTLIMIT but I would guess the problem with that is they're not actually trying to cruise around and be in more than 1-2 channels.

On an unrelated note and one that is probably out of scope for this reported issue I think a big problem with Matrix is that a lot of people don't know what the hell it's doing and the conclusion that people arrive at is that it's more broken than it actually is; if more instrumentation could be made available through common frontends like element about events (failed/in-flight/succeeded), kind, event duration by room, virtually anything that is measurable that helps people arrive at a conclusion about whether the problem is with them or somebody else would help to eliminate myths about whether or not it works correctly.

paigeadelethompson commented 3 hours ago

Also worth considering:

Summary: If I can find some time to setup the profiler and gather some usable metrics I will but just in case anybody feels inclined to try themselves; personally I think for regression y'all should be doing your development with a set GOMEMLIMIT because it says a lot more about how good of a job you're doing when your code can stand reliably on it's own given a substantial workload but for all I know people already are.