opendatacube / datacube-explorer

Web-based exploration of Open Data Cube collections
Apache License 2.0
54 stars 31 forks source link

Keep Away Well-behaved robots #522

Open whatnick opened 1 year ago

whatnick commented 1 year ago

Supply Header tags on HTML pages, top level robots.txt and HTTP Headers in STAC to prevent excessive crawling and associated DB loads.

An implementation for these recommendations : https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag

JonDHo commented 1 year ago

The question is whether you want to block all (well-behaved) bots at all levels. The problem that I have seen comes in when the bots start hitting the individual day pages. If you want to have the top level products discoverable via search engines, then perhaps this is only needed at certain levels. Perhaps the first level below /products/ could be allowed and then lower down disallowed.

JonDHo commented 1 year ago

I have just finished implementing a robots.txt (by adding a suitable ingress with a fixed response) and can confirm that the removal of bots resulted in a significant improvement on DB load (using AWS RDS Serverless).

The image below shows the effect of adding the robots.txt to explorer on the backend DB load. DB usage drops from around 1.5-2 Aurora capacity units (ACU) to just above 0.5. The minimum for this DB is set to 0.5.

image
omad commented 1 year ago

If you want to have the top level products discoverable via search engines, then perhaps this is only needed at certain levels. Perhaps the first level below /products/ could be allowed and then lower down disallowed.

This sounds like a sensible compromise to me. Is this what you implemented?

I have just finished implementing a robots.txt (by adding a suitable ingress with a fixed response) and can confirm that the removal of bots resulted in a significant improvement on DB load (using AWS RDS Serverless).

Could you share a copy of the robots.txt you created? Even if we don't put it into the Explorer code, it would be great to have as documentation.

@JonDHo

JonDHo commented 1 year ago

I am currently using the example below. This permits access to all general pages, including top level product pages, but none of the year, month, day or dataset pages

User-Agent: *
Allow: /
Disallow: /products/*/*

See: https://explorer.datacubechile.cl/robots.txt

JonDHo commented 2 weeks ago

Just an additional comment on this after the latest PR, I would also recommend adding:

Disallow: /dataset/*

to the default robots.txt. Bots hitting each individual dataset have been a big problem for me and the /dataset/* pages redirect to /products/ but they are valid URLs. I have actually now gone to the extent of disallowing everything as there isn't much benefit in having even the product pages discoverable via search engines. I would rather have my project website be the entry point for users searching the web, not explorer.