simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.07k stars 648 forks source link

Manage /robots.txt in Datasette core, block robots by default #1426

Open simonw opened 2 years ago

simonw commented 2 years ago

See accompanying Twitter thread: https://twitter.com/simonw/status/1424820203603431439

Datasette currently has a plugin for configuring robots.txt, but I'm beginning to think it should be part of core and crawlers should be blocked by default - having people explicitly opt-in to having their sites crawled and indexed feels a lot safer https://datasette.io/plugins/datasette-block-robots

I have a lot of Datasettes deployed now, and tailing logs shows that they are being hammered by search engine crawlers even though many of them are not interesting enough to warrant indexing.

I'm starting to think blocking crawlers would actually be a better default for most people, provided it was well documented and easy to understand how to allow them.

Default-deny is usually a better policy than default-allow!

simonw commented 2 years ago

A few options for how this would work:

Options could be:

The "limited" mode is particularly interesting. Could even make it the default, but I think that may be a bit too confusing. Idea would be to get the key pages indexed but use nofollow to discourage crawlers from indexing individual row pages or deep pages like https://datasette.io/content/repos?_facet=owner&_facet=language&_facet_array=topics&topics__arraycontains=sqlite#facet-owner.

simonw commented 2 years ago

I could try out the X-Robots HTTP header too: https://developers.google.com/search/docs/advanced/robots/robots_meta_tag#xrobotstag

simonw commented 2 years ago

https://twitter.com/mal/status/1424825895139876870

True pinging google should be part of the build process on a static site :)

That's another aspect of this: if you DO want your site crawled, teaching the datasette publish command how to ping Google when a deploy has gone out could be a nice improvement.

Annoyingly it looks like you need to configure an auth token of some sort in order to use their API though, which is likely too much hassle to be worth building into Datasette itself: https://developers.google.com/search/apis/indexing-api/v3/using-api

curl -X POST https://indexing.googleapis.com/v3/urlNotifications:publish -d '{
  "url": "https://careers.google.com/jobs/google/technical-writer",
  "type": "URL_UPDATED"
}' -H "Content-Type: application/json"

{
  "error": {
    "code": 401,
    "message": "Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.",
    "status": "UNAUTHENTICATED"
  }
}
simonw commented 2 years ago

At the very least Datasette should serve a blank /robots.txt by default - I'm seeing a ton of 404s for it in the logs.

simonw commented 2 years ago

Actually it looks like you can send a sitemap.xml to Google using an unauthenticated GET request to:

https://www.google.com/ping?sitemap=FULL_URL_OF_SITEMAP

According to https://developers.google.com/search/docs/advanced/sitemaps/build-sitemap

simonw commented 2 years ago

Bing's equivalent is: https://www.bing.com/webmasters/help/Sitemaps-3b5cf6ed

http://www.bing.com/ping?sitemap=FULL_URL_OF_SITEMAP
simonw commented 2 years ago

I was worried about if it's possible to allow access to /fixtures but deny access to /fixtures?sql=...

From various answers on Stack Overflow it looks like this should handle that:

User-agent: *
Disallow: /fixtures?

I could use this for tables too - it may well be OK to access table index pages while still avoiding pagination, facets etc. I think this should block both query strings and row pages while allowing the table page itself:

User-agent: *
Disallow: /fixtures/searchable?
Disallow: /fixtures/searchable/*

Could even accompany that with a sitemap.xml that explicitly lists all of the tables - which would mean adding sitemaps to Datasette core too.

tannewt commented 2 years ago

I think another thing would be to make /pages/robots.txt work. That way you can use jinja to generate a desired robots.txt. I'm using it to allow the main index and what it links to to be crawled (but not the database pages directly.)

knowledgecamp12 commented 2 years ago

You can generate xml site map from the online tools using https://tools4seo.site/xml-sitemap-generator.

louispotok commented 4 days ago

Upvoting this, at least the limited option described above. This is really easy to overlook when deploying, and it's hard to imagine a usecase for why you'd want a crawler to view all the row-level pages.

At the very least, maybe mention this in the deployment instructions?