rufuspollock / ideas

Ideas for (tech) stuff to research, build or work on.
https://rufuspollock.com/
50 stars 4 forks source link

Pubsubhubbub for data #69

Open psychemedia opened 10 years ago

psychemedia commented 10 years ago

Many government departments and international agencies publish data on a regular basis but don't necessarily provide subscription based alerts (eg via email) that announce the release of the data set the moment it is released.

Syndication technologies such as RSS/Atom provide one way of announcing new data releases in a syndicatable and subscribable way - users subscribe to an RSS/Atom feed , their feed reader polls the feed according to some schedule, and the user sees the results in their feed reader (or uses the feed to trigger some other alert).

As well as polling feeds according to a time based schedule, feed aggregators/readers can be alerted to the fact that a feed needs polling (ie new content has appeared on the feed).

Pubsubhubbub defines a protocol by which publishers can alert hubs as to the release of new content (datasets), hubs can retrieve this content and then alert subscribers.

The idea is this:

1) establish a datawire hub for aggregating alerts about data releases from third party publishers; 2) provide examples of how publishers can use pubsubhubbub to alert the datawire hub that a dataset has just been released 3) implement the approach in CKAN....?

The aim would be to produce a datawire hub that aggregates information about newly published data sets that provides a single point of access to information about data releases in lightweight way.

Rather than requiring manual upload of data files and their associated metadata (eg as used in data.gov.uk), the approach would allow opendata publishers to publish data on their own website and then alert to the hub as to its publication so it can be centrally indexed (small pieces, loosely joined...).

As a downstream business model, a datawire hub might additionally allow subscribers to set up alerting subscriptions for announcements of data released by a particular body or on a particular theme.

In much the same way that RSS/Atom extensions supported enclosed audio files for podcasts, the datawire hub might promote the use of particular extensions or metadata fields for enclosing datafiles (datapackage standards may be reusable in this context).

rossjones commented 10 years ago

Would this look something like https://github.com/arc64/datawi.re but without the filters?

Should point out that for some publishers, we (DGU) already take this approach in polling for information about datasets from their RSS feeds. A slightly more formalised approach would be great though.

psychemedia commented 10 years ago

@rossjone Yes, exactly that sort of thing. (The filters are for on-use; I was just thinking about the aggregation bit/getting stuff onto the wire. If folk are using CKAN, Socrata, or whatever, and those systems had plugins (enabled by default for openly licensed data?) for getting stuff onto the wire (if such a wire existed....) then things could bootstrap quite quickly?

rossjones commented 10 years ago

Most definitely, I can't imagine it would be hard to provide datapackages in an Atom feed, to some pubsubhubbub (for ref: https://code.google.com/p/pubsubhubbub/) installation.

For reading into CKAN this would partially duplicate the CKAN harvesters (which can harvest other CKAN instances for example) but that's probably a good thing if you subscribe to push > pull (and who doesn't ?).

For writing from CKAN, most CKAN instances already provide RSS feeds of dataset updates, perhaps the really quick win is providing a datapackages serialisation and adding the appropriate hub notification.

I reckon a reasonable way forward (at least for CKAN) is:

  1. Add datapackage serialis(z)ation to CKAN (can be done in an extension)
  2. Add hub ping on datapackage update (can also be done in an extension) and hub url to RSS feeds
  3. Extend the harvester to accept notifications from a hub for immediate updates.

We should try this approach on datahub.io as a PoC for 'real-time' dataset updates, particularly if someone is subscribed to the RSS feed for search results :)