kiwix / operations

Kiwix Kubernetes Cluster
http://charts.k8s.kiwix.org/
5 stars 0 forks source link

Library unavailability (3.5.0-2) #103

Closed rgaudin closed 10 months ago

rgaudin commented 11 months ago

This is a tracking ticket to observe the evolution of a new situation.

On Friday 2023-07-21 at 15:02 UTC, we've upgraded library.kiwix.org to kiwix-serve 3.5.0-2.

Since, Uptime robot (HEAD https://library.kiwix.org) has seen two 503 Service Unavailable events:

This is unexpected. The last time we had a 503 was 2 years ago. This service is virtually never down.

Note: library.kiwix.org is composed of three services:

The catalog service/endpoint was not affected. For the sake of it, I've just created an uptime robot monitor for the catalog endpoint.

I found that this 503 is not the result of a problem with the k8s nginx proxy nor the varnish cache but a restart of the library-demo container that serves the home page.

This container restarts frequently (kiwix-serve crashing) but we've never been affected because:

Possibilities:

Notes:

rgaudin commented 11 months ago

Updated:

rgaudin commented 10 months ago

Updated:

rgaudin commented 10 months ago

I have just configured two replicas for both the catalog backend and the demo backend. This is not a real load balancer, but requests are randomly sent to each of the two replicas randomly (0.5 each). No round-robin or anything fancy ; it's done by iptables at routing level.

@mgautierfr do you think it could make things worst in terms of performance? Having a single instance ensured that we'd use ZIM access cache correctly… but with two instances, we'd have ZIM caches on both for the same ZIM files…

Given every single page generates a lot of requests, I don't think it's gonna be much of an isssue.

mgautierfr commented 10 months ago

I'm not sure this has a lot of impact at zim level. Of course, it will double the cache as now we have two caches (one per process). But a lot of things are either common or totally unrelated:

But it can have a impact at kiwix level when we cache the whole Archive reader. If each requests are randomly balanced, both instance will have to open the archive. Would it be possible to load balance randomly but base on the IP (all requests of the same IP goes to the same instance) ? For common zim file (wikipedia english) both instance will probably ends by opening it, but for uncommon zim with only one reader, only one instance will handle it.

rgaudin commented 10 months ago

We discovered this week that varnish was not caching kiwix-serve requests anymore since the change. Our assessment was thus incorrect about what happened. Closing.

FYI, we now have a daily check to ensure we're notified should we stop returning non cached requests