Closed raphael0202 closed 1 year ago
The most impactful fixes have been deployed (language-dependent robots.txt + noindex pages), so this issue can be closed once we're sure they are effective.
Crawlers now accounts for ~15% of query. The total number of requests made by Google seems to stabilize at a lower level:
Closing this issue.
Context
In the past years, we've had an ongoing performance issue: the website is slow, times-out often and prevents regularly contributors from updating products. These performance issues can be attributed partially from undersized servers (for example performance have been degraded since the temporary migration of MongoDB to a smaller server), but @stephanegigandet suspected early that the performance issue originates from requests involving aggregated MongoDB queries (such as the one we find on facets).
After a contributor couldn't save any of his change, I wanted to investigate more precisely the reasons behind these poor performances.
Some numbers
According to all the monitoring tools we have at our disposal, the main web-page loads in ~4s.
At the time of writing this issue, monitoring shows that main webpage failed to send a response 17 times in the last 24h:
Google indexer considers that only 0.7% of its visited URLs are loaded "fast". Besides, contributors/users regularly send us some messages on Slack on Playstore saying the app/website doesn't work.
Investigation
On 2023-07-26, I noticed Bingbot requests accounted for ~6% of all our traffic. Half of theses queries were facet queries, that trigger a MongoDB aggregation query. Bigbot made between 100-170 requests per minute. As a result, we decided to temporarily ban all Bingbot traffic through iptables rules.
Every time I checked (at different dates), Googlebot accounted for ~30-40% of our traffic (!). Googlebot therefore crawls our website continously, without stopping.
Google performed less facet queries than Bingbot, but it was still a significant proportion that the total. It's therefore likely that Googlebot (and possibly other crawlers) are responsible for the degraded website performance.
Despite this very high pressure on our server, only 1,63 M webpages are indexed on world.openfoodfacts.org. We have 2.9 millions products worldwide, so more than 44% of our products are not indexed.
Beside, facet pages are not really well indexed as well:
Number of pages indexed on world.openfoodfacts.org/{tagname}/* on Google: category: 5 980 (on 50k) label: 2300 (on 25k) ingredient: 2970 (on 1.2M) brand: 8650 (on 167k) additive: 2980 (on 583 -> multiple pages for the same additive) packaging: 1100 (on 25k)
Mitigation
After banning bingbot, we decided to return empty HTML page with a "noindex" header only for crawl bots on specific pages:
/brand/nutella/editor/raphael
)/packager-code/*
), we only allow indexing the most interesting facets with a limited number of elements. Anyhow, we only allow 1st page of each facet to be indexed.A HTTP query is sent in anycase by the bot to fetch the page, but processing is almost instant on our side, as we return a static page.
These mitigation measures may not be sufficient, as we have 170 countries * hundreds of languages (ex of subdomains: world-fr, fr-en,...) that are potentially all accessible by web crawlers.
If we avoid most of the aggregated queries, we should anyway notice performance improvements.
Other measures to consider:
rel=nofollow
attributes, to discourages crawlers to follow internal links we do not want them to crawl (especially facet pages). We already do this for some facet pages (after page >= 2). @alexgarel pointed out that it may not be a good idea as it conveys the signal the URL is not qualitative, to investigate.Part of
8764