openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
648 stars 374 forks source link

🕷️ The web crawler issue #8751

Closed raphael0202 closed 1 year ago

raphael0202 commented 1 year ago

Context

In the past years, we've had an ongoing performance issue: the website is slow, times-out often and prevents regularly contributors from updating products. These performance issues can be attributed partially from undersized servers (for example performance have been degraded since the temporary migration of MongoDB to a smaller server), but @stephanegigandet suspected early that the performance issue originates from requests involving aggregated MongoDB queries (such as the one we find on facets).

After a contributor couldn't save any of his change, I wanted to investigate more precisely the reasons behind these poor performances.

Some numbers

According to all the monitoring tools we have at our disposal, the main web-page loads in ~4s.

Capture d’écran du 2023-07-28 16-33-22

At the time of writing this issue, monitoring shows that main webpage failed to send a response 17 times in the last 24h: Capture d’écran du 2023-07-28 16-36-12

Google indexer considers that only 0.7% of its visited URLs are loaded "fast". Besides, contributors/users regularly send us some messages on Slack on Playstore saying the app/website doesn't work.

Investigation

On 2023-07-26, I noticed Bingbot requests accounted for ~6% of all our traffic. Half of theses queries were facet queries, that trigger a MongoDB aggregation query. Bigbot made between 100-170 requests per minute. As a result, we decided to temporarily ban all Bingbot traffic through iptables rules.

Every time I checked (at different dates), Googlebot accounted for ~30-40% of our traffic (!). Googlebot therefore crawls our website continously, without stopping.

Google performed less facet queries than Bingbot, but it was still a significant proportion that the total. It's therefore likely that Googlebot (and possibly other crawlers) are responsible for the degraded website performance.

Despite this very high pressure on our server, only 1,63 M webpages are indexed on world.openfoodfacts.org. We have 2.9 millions products worldwide, so more than 44% of our products are not indexed.

Beside, facet pages are not really well indexed as well:

Number of pages indexed on world.openfoodfacts.org/{tagname}/* on Google: category: 5 980 (on 50k) label: 2300 (on 25k) ingredient: 2970 (on 1.2M) brand: 8650 (on 167k) additive: 2980 (on 583 -> multiple pages for the same additive) packaging: 1100 (on 25k)

Mitigation

After banning bingbot, we decided to return empty HTML page with a "noindex" header only for crawl bots on specific pages:

A HTTP query is sent in anycase by the bot to fetch the page, but processing is almost instant on our side, as we return a static page.

These mitigation measures may not be sufficient, as we have 170 countries * hundreds of languages (ex of subdomains: world-fr, fr-en,...) that are potentially all accessible by web crawlers.

If we avoid most of the aggregated queries, we should anyway notice performance improvements.

Other measures to consider:

raphael0202 commented 1 year ago

The most impactful fixes have been deployed (language-dependent robots.txt + noindex pages), so this issue can be closed once we're sure they are effective.

raphael0202 commented 1 year ago

Crawlers now accounts for ~15% of query. The total number of requests made by Google seems to stabilize at a lower level: Capture d’écran du 2023-09-05 11-59-44

Closing this issue.