kiwix / operations

Kiwix Kubernetes Cluster
http://charts.k8s.kiwix.org/
7 stars 0 forks source link

Why dev.library.kiwix.org is regularly extremly slow? #194

Closed kelson42 closed 1 month ago

kelson42 commented 6 months ago

I guess the hardware is on its limit but what are the services mostly responsible for that?

rgaudin commented 6 months ago

it's not (entirely) an hardware issue. It's kiwix-serve crashing frequently. We chose to expose it directly to be aware of those things but it seems that the higher number of test zims increased the instability.

I'd suggest we assess the need for each zim there and remove or move elsewhere those that don't need to be there (anymore). This would greatly simplify any investigation.

We already have a ticket on lib kiwix about those crashes.

If we want to rely on dev library, kiwix serve should not be exposed.

benoit74 commented 6 months ago

I don't think it is hardware limitation either, library.kiwix.org is running on the same machine and is not experiencing much slow down.

Dev library is currently serving 897 ZIMs which should not be a concern (at least I would expect kiwix-serve to be able to handle this amount of ZIMs when run anywhere in the wild)

So while we all agree we could probably prune most ZIMs present in this dev library, I consider this is not the right approach yet.

Current situation is more a good opportunity to learn what is going wrong.

This is the memory consumption of kiwix-serve for dev library (timezone is UTC)

image

As you see, it restarts many times per day. Some of these restarts (e.g. at 4am UTC this morning) is linked to a rolling update due to a new image being available (we use the nightly build which is rebuild quite a lot obviously), hence the short double RAM usage (kubernetes starts a new updated container before stopping the old one).

What is interesting to notice is that it seems to restart every time we get close to 1G RAM ... which is the limit of memory we've assigned for this container in Kubernetes. It does not looks like it is an OOM kill however, I do not find usual logs stating this event. This is nevertheless a very significant difference with prod which does not have any limit in terms of memory consumption.

As an experiment, I've increased the memory limit to 1.5G, so that we can confirm if there is a correlation between the memory consumption / limit and the service restarts.

Another aspects to keep in mind, as already stated in https://github.com/kiwix/k8s/issues/147, is that we have a number of levers at our disposal to customize kiwix-serve behavior and control its memory consumption. None of them have been customized for dev.library. I continue to consider that doing small experiments on these values would greatly help understand and properly customize kiwix-serve behavior.

benoit74 commented 6 months ago

I've pushed the memory graph to a dashboard dedicated to dev library. I hope we will update this dashboard with more metrics upon time.

https://kiwixorg.grafana.net/d/fdlyk9cwqr8xsb/dev-library?orgId=1

rgaudin commented 6 months ago

So while we all agree we could probably prune most ZIMs present in this dev library, I consider this is not the right approach yet.

My suggestion is linked to https://github.com/kiwix/libkiwix/issues/760. We believe that some incorrect (how?) ZIM trigger crashes. Since we did not remove ZIMs from there, the culprits of that time are probably still present. I am curious to know if removing them would reduce the number of crashes/restart.

We want to investigate those crashes but it's unrealistic because the library is huge, we have no idea which ones causes issues and the kiwix-serve logs are unusable: because of its formatting, because there's multi-users traffic at all times and because it doesn't log probably when this happens.

The periodic restarts might be RAM related ; we'll see if the graph repeat but around 1.5GB 👍

benoit74 commented 6 months ago

Increasing memory available might then help with restarts but make the situation worse regarding crashes ^^

If we confirm we still have crashes, I would suggest to simply trash mostly the whole dev library in a one-shot manual action:

kelson42 commented 6 months ago

Current situation is more a good opportunity to learn what is going wrong.

I really agree with this. What could be done as approach to better identify the reproduction steps for crash scenarios?

To me: good chances we have a problem around Kiwix Server mgmt, see also https://github.com/kiwix/libkiwix/issues/1025

benoit74 commented 6 months ago

Experiment conclusion seems quite clear: when we add more RAM, the DEV server restarts way less often.

image

I increased the allocated RAM even further to 2.5GB, which seems to be sufficient for 24h activity (the DEV server always restarts at 4am UTC to apply the nightly build. I don't say this is the proper long term solution, but it might allow to confirm if we still suffer from crashes and when.

What could be done as approach to better identify the reproduction steps for crash scenarios?

I don't know

rgaudin commented 6 months ago

Fortunately, there's a lot of RAM to spare on storage server.

benoit74 commented 6 months ago

Fortunately, there's a lot of RAM to spare on storage server.

Yep, and I'm quite sure I will soon start to experiment with kiwix-serve environment variables to reduce this RAM usage to a way more sustainable level 🤓

rgaudin commented 6 months ago

Yes, as discussed separately ; it's really important that those switches are properly documented so we can also leverage them on the hotspot.

kelson42 commented 6 months ago

See also #170

kelson42 commented 6 months ago

@rgaudin @benoit74 i believe we might run a performance push taskforce around kiwix-serve to tackle these kind of problems. Might be actually a hackhaton topic.

rgaudin commented 6 months ago
Screenshot 2024-05-27 at 08 08 13

Twice this week the GH actions that runs at 8am UTC failed: On May 25th and on May 27th. In both cases I get Read timed out (5s) on the test but the service is running, has not restarted and is not close to the RAM crash in the graph. Testing on some random ZIM/content soon after one failure was working OK. Maybe some requests from the tests (all are catalog related) are difficult to answer within 5s under certain circumstances…

benoit74 commented 6 months ago

As discussed yesterday, we all consider it is now time to move to a plan B, but this plan is still unclear.

From my perspective, the experience with dev.library.kiwix.org is way better than before the RAM increase, but it is still not satisfactory, i.e. there are still some slowdowns.

After some thought, I wonder if these slowdowns are not just linked to IO issues on the disk. On production, these issues could be hidden by the varnish cache which is expected to be especially efficient on the catalog and hence won't trigger problem on 8am UTC tests.

How easy would it be to implement a Varnish cache in front of dev.library.kiwix.org as well? It looks pretty straightforward to me, and will clearly a flight forward, it will help to confirm the problem is most-probably IO related.

rgaudin commented 6 months ago

it will help to confirm the problem is most-probably IO related.

Absolutely not. It will hide everything but the first request to a resource. It would be a good measure to improve the service to users but will not help (on the contrary) with finding the actual cause(s) behind this.

I am still awaiting an update clarifying the role of dev.library. We had a lot of discussions about this when we started it but it seems to have shifted.

Currently this is an internal testing tool:

I understand we are now sending links to users/clients on dev.library. That's the role of a staging library.

Do we want prod/staging/dev ? Just prod/staging?

kelson42 commented 6 months ago

Currently this is an internal testing tool:

Yes and we see that this might be quickly challenging to change the scope of it Therefore I have open a dedicated issue to think our requirement out of the box. See #199

rgaudin commented 5 months ago

Still fails everyday (timeout)

Screenshot 2024-06-11 at 08 06 27

Not related to resources apparently (restarts at 04:00)

Screenshot 2024-06-11 at 08 08 32
benoit74 commented 1 month ago

Is there still a problem or shall we close this?

rgaudin commented 1 month ago

Workflows havent failed since we changed HW