extra content on publishing big open data in small fragments

In https://github.com/theodi/big-data-publishing/blob/master/guide/big-data-publishing.md#publishing-big-open-data-in-small-fragments I think it would be well worth exploring the issues around supporting scraping from a website. In particular, you should mention using robots.txt, sitemaps and canonical URLs to ensure that spiders are directed towards the data that they're actually interested in.

From experience with legislation.gov.uk, we had problems with (unintentional) DDOS attacks from spiders that didn't respect those hints and gathered material indiscriminately rather that focusing on formats they understood (eg they would get the RDF, PDF and XML versions of each page rather than just getting the HTML which was all they really cared about).

The sitemaps were useful though. It's a bit of an art to generate them because of the limitations on size that they have (you can have a sitemap of sitemaps, but only one level deep, and both the sitemap of sitemaps and the sitemaps themselves are limited in number of entries).

It's also worth mentioning the pattern of publishing a dump supplemented by a feed of changes as a way of managing rapidly changing datasets. Perhaps I just haven't got to that yet.

theodi / big-data-publishing

extra content on publishing big open data in small fragments #23