🕷️ The web crawler issue

raphael0202 commented 1 year ago

Context

In the past years, we've had an ongoing performance issue: the website is slow, times-out often and prevents regularly contributors from updating products. These performance issues can be attributed partially from undersized servers (for example performance have been degraded since the temporary migration of MongoDB to a smaller server), but @stephanegigandet suspected early that the performance issue originates from requests involving aggregated MongoDB queries (such as the one we find on facets).

After a contributor couldn't save any of his change, I wanted to investigate more precisely the reasons behind these poor performances.

Some numbers

According to all the monitoring tools we have at our disposal, the main web-page loads in ~4s.

Capture d’écran du 2023-07-28 16-33-22

At the time of writing this issue, monitoring shows that main webpage failed to send a response 17 times in the last 24h: Capture d’écran du 2023-07-28 16-36-12

Google indexer considers that only 0.7% of its visited URLs are loaded "fast". Besides, contributors/users regularly send us some messages on Slack on Playstore saying the app/website doesn't work.

Investigation

On 2023-07-26, I noticed Bingbot requests accounted for ~6% of all our traffic. Half of theses queries were facet queries, that trigger a MongoDB aggregation query. Bigbot made between 100-170 requests per minute. As a result, we decided to temporarily ban all Bingbot traffic through iptables rules.

Every time I checked (at different dates), Googlebot accounted for ~30-40% of our traffic (!). Googlebot therefore crawls our website continously, without stopping.

Google performed less facet queries than Bingbot, but it was still a significant proportion that the total. It's therefore likely that Googlebot (and possibly other crawlers) are responsible for the degraded website performance.

Despite this very high pressure on our server, only 1,63 M webpages are indexed on world.openfoodfacts.org. We have 2.9 millions products worldwide, so more than 44% of our products are not indexed.

Beside, facet pages are not really well indexed as well:

Number of pages indexed on world.openfoodfacts.org/{tagname}/* on Google: category: 5 980 (on 50k) label: 2300 (on 25k) ingredient: 2970 (on 1.2M) brand: 8650 (on 167k) additive: 2980 (on 583 -> multiple pages for the same additive) packaging: 1100 (on 25k)

Mitigation

After banning bingbot, we decided to return empty HTML page with a "noindex" header only for crawl bots on specific pages:

nested facet pages (ex: /brand/nutella/editor/raphael)
most facet pages (ex: /packager-code/*), we only allow indexing the most interesting facets with a limited number of elements. Anyhow, we only allow 1st page of each facet to be indexed.

A HTTP query is sent in anycase by the bot to fetch the page, but processing is almost instant on our side, as we return a static page.

These mitigation measures may not be sufficient, as we have 170 countries * hundreds of languages (ex of subdomains: world-fr, fr-en,...) that are potentially all accessible by web crawlers.

If we avoid most of the aggregated queries, we should anyway notice performance improvements.

Other measures to consider:

use more rel=nofollow attributes, to discourages crawlers to follow internal links we do not want them to crawl (especially facet pages). We already do this for some facet pages (after page >= 2). @alexgarel pointed out that it may not be a good idea as it conveys the signal the URL is not qualitative, to investigate.
add a sitemap (see https://github.com/openfoodfacts/openfoodfacts-server/pull/2878). Sitemaps allow crawlers to focus on the content we consider important . I would suggest to have in the sitemap product pages as a first step, and potentially a few brand/category facets.
block indexation on most lc subdomains (ex: fr-it, it-es,...). Only keep combinations that makes sense given the country code.
Add a crawl-delay to robots.txt (https://github.com/openfoodfacts/openfoodfacts-server/issues/7965). After consideration, I'm not conviced it's a good idea, as crawlers don't manage to index every product page at full speed. We primarily need to stop crawlers from making heavy queries.
Make aggregated queries faster (this is something @john-gom is currently investigating).
Part of
8764

raphael0202 commented 1 year ago

The most impactful fixes have been deployed (language-dependent robots.txt + noindex pages), so this issue can be closed once we're sure they are effective.

raphael0202 commented 1 year ago

Crawlers now accounts for ~15% of query. The total number of requests made by Google seems to stabilize at a lower level: Capture d’écran du 2023-09-05 11-59-44

Closing this issue.

openfoodfacts / openfoodfacts-server