Add export functionality

waldoj commented 9 years ago

Something important and missing from this proposal is export functionality. We want it to be simple for somebody to take their data out of such a system to bring it over to another host. No doubt sites are going to outgrow this setup. They need to be able to take all of their data with them to move to a larger host that can accommodate customizations, larger amounts of traffic, etc. That host might be CKAN, but it might be Socrata or Junar or DKAN or whatever else.

rossjones commented 9 years ago

CKAN already has support for RDF/CSV/JSON dumps of the datasets, but this is currently a CLI command (db dump, and rdfexport respectively).

It's often something that you don't want to do at runtime, particularly if you have a log of packages, but as a nightly task, or a one-off not directly in the http request flow it can definitely work.

Perhaps just extending the dumps to ensure they contain the organisation/usernames etc might work?

waldoj commented 9 years ago

Perhaps just extending the dumps to ensure they contain the organisation/usernames etc might work?

I suspect strongly that this is all that's going to be required. We'll certainly study closely that existing functionality. I've never had to export data from CKAN, but only read about it—I'll log into my CKAN instance in the morning and try it out. :) Thanks for your insights on this!

jqnatividad commented 9 years ago

We should also consider exporting the installation metadata i.e. CKAN version, config, plugins installed, disk space required, etc. as part of the export.

In that way, a user can move to another CKAN provider with confidence that the import will take.

For export, maybe we also can leverage data.json, especially for migrations going to another non-CKAN system.

waldoj commented 9 years ago

I've been playing with CKAN's various command-line export functions, and I think it's most of the way there. Exporting datasets is pretty good (with paster db simple-dump-json -c /etc/ckan/default/production.ini my_datasets.json), and exporting users works well enough.

But only dataset metadata is exported. CKAN's docs instruct us to edit Apache's config file to export data, with what's basically a hack—disabling the file handler for the directory where datasets are stored, and turning on directory listings. So you don't actually get the files in an export—you get a directory listing where you can right-click on each file and save it. And then you have to correlate it with the exported metadata, which is possible only awkwardly—the filename (e.g., 8f-4995-4709-9d13-a683693dd8ac) is a string that serves as the resource ID (e.g., 1797da8f-4995-4709-9d13-a683693dd8ac).

Proper export functionality necessitates including all of the files and providing a direct correlation between the filename and an identifier within the exported metadata. Also, the config file that drives the site (e.g., development.ini), in case the exported data is going to be used in another CKAN site.

waldoj commented 9 years ago

We've just published an RFP for this small aspect of the overall project. Bidding is open through December 31.

wardi commented 9 years ago

https://github.com/ckan/ckanapi might also be a good place to start. It can already export and import dataset, group and org metadata with multiple connections in parallel. It's also MIT licensed.

waldoj commented 9 years ago

I had no idea that ckapapi had that particular functionality! I'll add that to the RFP as a possible path—thanks so much, @wardi.

wardi commented 9 years ago

Work on this is close to complete https://github.com/ckan/ckanapi/pull/37

opendata / CKAN-Multisite-Plans

Add export functionality #4