Open rgaudin opened 3 months ago
I don't think that removing our wikis from search engines is an adequate move. There is very important information for new comers on these wikis, they are already hard to find, so would become even worse without indexing in search engines.
While "AI crawlers" are probably less relevant, I think it could still be useful to let them proceed as well given that tools like LLMs might soon replace search engines in many users workflows when looking after some information.
I think the problem is more that our wiki is not capable to cope with the load "imposed" by these crawlers, and we need to find a solution for this.
Do you achieved to confirm that adding a robots.txt reduced the load / ocurrences of connection timeouts?
I did not look at the load evolution but after 24h of the change there hasn't been any uptimerobot alert so there is a correlation.
I share your opinion that this is original content that has value and should be indexed. Fixing our mediawiki setup is probably the proper thing to do… but not the cheapest.
I never said that I was happy to have to find time to fix the mediawiki setup, or that it was going to be an easy feat ^^
Last night the Kiwix Wiki monitor was constantly throwing errors (Connection Timeout, 502). The service did not restart but had apparently difficulties handling a high number of requests. Peaking at live logs for a second, I see continuous requests from crawlers: Bytedance, Amazon, Claude, OpenAI, Bing were mentioned in this few-seconds window.
I added a denying robots.txt for both Wikis as there was none but I suppose crawlers don't look for it frequently (if they do at all). Nevertheless, about 30mn after that a successful monitor was seen.
Now that those things are more frequent, widespread and impacting our infrastructure, we might want to discuss what to do. Generalizing robots.txt seems in order. Should we do more?