Document options for harvesting/importing/syncing remote metadata/data

JJediny commented 8 years ago

Part of the appeal of today's distributed content generation is the ability not to have to separately maintain metadata/datasets - when they are better maintained elsewhere by others. It's safe to assume that many use-cases that JKAN calls for will need/want to take a hybrid approach to catalog both a collection of datasets maintained on JKAN together with those from remote sources.

@pjdufour started work to decompile a data.json file and import it as individual yml/md file
There is also the ability to use projects like http://pycsw.org to harvest/decompile a collection of geospatial records from a CSW service which outputs a series of xml/json files like this example
There is also the ability to run CKAN in development and use it's harvester to import remote metadata collections and issue internal commands like ckan db simple-dump-json FILE_PATH or ckan db simple-dump-json FILE_PATH to export those records like this example

Harvesting/snap-shoting datasets can be a version control nightmare, but its arguably better then recreating them entirely manually... However documentation could/should cover a few of the best options/processes out there to achieve the closest thing to syncing across multiple remote services. As an alternative/complementary approach it would also be good to include methods to integrate push notifications or webhooks to for example run a build and a gulp process to refresh a remote source and have a repeatable process to manage the fetch/ingest/transform/import process... this could then rebuilt a new docker container with jkan or run locally and commit the bulk updates

JJediny commented 8 years ago

This also calls into the need to have a canonical source field to identify if the record on JKAN is externally or internally maintained

timwis commented 8 years ago

Interesting idea @JJediny. To be honest, I don't have much experience doing that (I couldn't get ckan harvester to work). I agree we should document it though. Would you be open to working on a page in the wiki about it?

(Also, regarding #62, just waiting to hear from you on the dataset slug question)

timwis / jkan

Document options for harvesting/importing/syncing remote metadata/data #77