kiwix / operations

Kiwix Kubernetes Cluster
http://charts.k8s.kiwix.org/
7 stars 0 forks source link

What about crawlers? #240

Open rgaudin opened 3 months ago

rgaudin commented 3 months ago

Last night the Kiwix Wiki monitor was constantly throwing errors (Connection Timeout, 502). The service did not restart but had apparently difficulties handling a high number of requests. Peaking at live logs for a second, I see continuous requests from crawlers: Bytedance, Amazon, Claude, OpenAI, Bing were mentioned in this few-seconds window.

I added a denying robots.txt for both Wikis as there was none but I suppose crawlers don't look for it frequently (if they do at all). Nevertheless, about 30mn after that a successful monitor was seen.

Now that those things are more frequent, widespread and impacting our infrastructure, we might want to discuss what to do. Generalizing robots.txt seems in order. Should we do more?

benoit74 commented 2 months ago

I don't think that removing our wikis from search engines is an adequate move. There is very important information for new comers on these wikis, they are already hard to find, so would become even worse without indexing in search engines.

While "AI crawlers" are probably less relevant, I think it could still be useful to let them proceed as well given that tools like LLMs might soon replace search engines in many users workflows when looking after some information.

I think the problem is more that our wiki is not capable to cope with the load "imposed" by these crawlers, and we need to find a solution for this.

Do you achieved to confirm that adding a robots.txt reduced the load / ocurrences of connection timeouts?

rgaudin commented 2 months ago

I did not look at the load evolution but after 24h of the change there hasn't been any uptimerobot alert so there is a correlation.

I share your opinion that this is original content that has value and should be indexed. Fixing our mediawiki setup is probably the proper thing to do… but not the cheapest.

benoit74 commented 2 months ago

I never said that I was happy to have to find time to fix the mediawiki setup, or that it was going to be an easy feat ^^