Open simonw opened 2 years ago
A few options for how this would work:
datasette ... --robots allow
datasette ... --setting robots allow
Options could be:
allow
- allow all crawlingdeny
- deny all crawlinglimited
- allow access to the homepage and the index pages for each database and each table, but disallow crawling any further than thatThe "limited" mode is particularly interesting. Could even make it the default, but I think that may be a bit too confusing. Idea would be to get the key pages indexed but use nofollow
to discourage crawlers from indexing individual row pages or deep pages like https://datasette.io/content/repos?_facet=owner&_facet=language&_facet_array=topics&topics__arraycontains=sqlite#facet-owner
.
I could try out the X-Robots
HTTP header too: https://developers.google.com/search/docs/advanced/robots/robots_meta_tag#xrobotstag
https://twitter.com/mal/status/1424825895139876870
True pinging google should be part of the build process on a static site :)
That's another aspect of this: if you DO want your site crawled, teaching the datasette publish
command how to ping Google when a deploy has gone out could be a nice improvement.
Annoyingly it looks like you need to configure an auth token of some sort in order to use their API though, which is likely too much hassle to be worth building into Datasette itself: https://developers.google.com/search/apis/indexing-api/v3/using-api
curl -X POST https://indexing.googleapis.com/v3/urlNotifications:publish -d '{
"url": "https://careers.google.com/jobs/google/technical-writer",
"type": "URL_UPDATED"
}' -H "Content-Type: application/json"
{
"error": {
"code": 401,
"message": "Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.",
"status": "UNAUTHENTICATED"
}
}
At the very least Datasette should serve a blank /robots.txt
by default - I'm seeing a ton of 404s for it in the logs.
Actually it looks like you can send a sitemap.xml
to Google using an unauthenticated GET request to:
https://www.google.com/ping?sitemap=FULL_URL_OF_SITEMAP
According to https://developers.google.com/search/docs/advanced/sitemaps/build-sitemap
Bing's equivalent is: https://www.bing.com/webmasters/help/Sitemaps-3b5cf6ed
http://www.bing.com/ping?sitemap=FULL_URL_OF_SITEMAP
I was worried about if it's possible to allow access to /fixtures
but deny access to /fixtures?sql=...
From various answers on Stack Overflow it looks like this should handle that:
User-agent: *
Disallow: /fixtures?
I could use this for tables too - it may well be OK to access table index pages while still avoiding pagination, facets etc. I think this should block both query strings and row pages while allowing the table page itself:
User-agent: *
Disallow: /fixtures/searchable?
Disallow: /fixtures/searchable/*
Could even accompany that with a sitemap.xml
that explicitly lists all of the tables - which would mean adding sitemaps to Datasette core too.
I think another thing would be to make /pages/robots.txt
work. That way you can use jinja to generate a desired robots.txt. I'm using it to allow the main index and what it links to to be crawled (but not the database pages directly.)
You can generate xml site map from the online tools using https://tools4seo.site/xml-sitemap-generator.
Upvoting this, at least the limited
option described above. This is really easy to overlook when deploying, and it's hard to imagine a usecase for why you'd want a crawler to view all the row-level pages.
At the very least, maybe mention this in the deployment instructions?
See accompanying Twitter thread: https://twitter.com/simonw/status/1424820203603431439
I have a lot of Datasettes deployed now, and tailing logs shows that they are being hammered by search engine crawlers even though many of them are not interesting enough to warrant indexing.
I'm starting to think blocking crawlers would actually be a better default for most people, provided it was well documented and easy to understand how to allow them.
Default-deny is usually a better policy than default-allow!