opendata / Open-Data-Needs

An ongoing effort to catalog the holes in the open data ecosystem. [RETIRED]
15 stars 0 forks source link

Simplify vastly the deployment of an open data repository #5

Open waldoj opened 10 years ago

waldoj commented 10 years ago

This is a need on the supply side of open data, not the demand side.

Installing CKAN, DKAN, or Open Data Catalog is far too difficult for most government agencies. They lack the technical capacity to do so in-house, and they do not have the budget to pay a third party to do so. Commercial offerings aren't financially viable for many municipal governments, at least not yet. (To be clear, it's not that they don't have $1,000 or $10,000 or whatever, it's that the agency or government has not yet prioritized the provision of open data sufficiently to allocate resources to the effort. Staff has to do this within their existing job duties, with no funding.) Creating a Docker instance or an EC2 image would not help—those are both far too complex for this target audience.

We need a trivially simple way for a small agency to establish an open data site. For example, imagine that multiple CKAN instances could be hosted on a single server, sharing a Solr instance. (I have no idea if that's so, but let's pretend.) Imagine that a 4-core, 8GB server ($80/month on DigitalOcean) could host 50 of these low-traffic, low-data sites. Anybody with a .gov address (or non-.gov-having governments, subject to manual approval) could deploy a new data catalog. With a suitably generic domain name (e.g., datacatalog.tld) and descriptive subdomains (e.g., elko-nv.datacatalog.tld), a government could have its own data catalog.

Of course, the idea wouldn't be to provide free hosting indefinitely, but instead to make it possible for a government to dip its toes into providing a data catalog. So we'd need a method to get them to stop using this free service, and start doing it on their own. A ticking clock (maybe a one-year limit) or a low limit in the number of data sets that they can provide (perhaps fifty) both come to mind.

Best of all, this could all be built as an open source project that anybody could deploy, meaning that as many organizations wanted to provide this service could do so. A state government could deploy this for its localities, so that any locality can create an open data repository. That, of course, would not need to self-destruct after a year, or have a cap of fifty datasets.

waldoj commented 10 years ago

We'd need to provide a way to make it trivial to transfer data from the trial site to the permanent site. We can't very well expect people to start over again. And it needs to be vendor-neutral. If this service ran on CKAN, but an agency is moving to Junar, we want to make sure that there's a file that they can download that contains all of their records, or that Junar can pull all of the data down from the CKAN repository.

rufuspollock commented 10 years ago

@waldoj good thoughts - I note that https://github.com/ckan/ckan/pull/1724 (plus https://github.com/ckan/ideas-and-roadmap/issues/23 covers general docker stuff). Whilst not of direct benefit will simplify the deploys and if people want to try stuff out you can get a free organization account on http://datahub.io/ in a few seconds.

waldoj commented 10 years ago

I didn't know y'all were Docker-ifying CKAN—that's wonderful! That will certainly make it really, really easy to create a standard platform for spinning up new instances.

You're right that there are definitely going to be agencies who are quite happy to host on datahub—thank you so much for mentioning that here. (Some are going to be concerned about commingling their data on a site that also hosts others' data, an objection that I see to GitHub, too. I do not sympathize with this objection, but I do understand it.) Separately, we're working on a how-to guide to publishing open data, and I've noted that we need to promote datahub in that context.

Thank you, @rgrp!

waldoj commented 10 years ago

I hadn't visited datahub since prior to the move to organizations last October. What a huge difference. I barely recognize it. It used to be thick with spam, and now every dataset that I see is clearly legit. What a big improvement.

rgradeck commented 10 years ago

We're exploring this shared infrastructure concept at the regional level in Southwestern PA - We are considering all aspects of infrastructure to include technology, training, data wrangling, standards, legal, and community engagement. We have many potential users - some with, most without capacity to do this on their own. Still in the conceptual stage, but moving ahead this summer. Would benefit from any insights in how best to approach this (not only technology), and look forward to sharing back our experiences.

waldoj commented 10 years ago

@rgadeck, our conversation is part of what has me thinking about this need. Organizations like yours are in a position to provide data hosting for entities that don't have the ability to do so on their own. If you could do so for $5/month/entity (the hosting costs you'd have to pay), would that be attractive to you?

rgradeck commented 10 years ago

Any model is fair game. Our choices aren't going to be evaluated solely by cost - It's important that governments and other providers find it easy to prepare and load data, and see value in making their info. available to others. We also want to make sure that if something breaks, there's a backup, and if something better comes along, we can get the data back out.

brianjgeiger commented 9 years ago

It's probably worth mentioning that, if the data has value to scientists, the Open Science Framework might be a good place to store files. http://osf.io/ It's not necessarily useful for all agencies, but there are definitely groups that could take advantage of it.

waldoj commented 9 years ago

:+1:

florianm commented 9 years ago

Late to the party... there's another level of difficulty: some agencies, if not all, have sensitive datasets which in their full resolution are unfit for public release.

However, all contributions from data custodians happen on the sensitive, non-public level. Processing data to remove sensitive information and make it fit for public release happens after that.

Our solution is to simply run several CKAN catalogues - one non-public internal-only, one external (plus one as a testing sandbox). I've documented this multi-tenant setup at https://github.com/opendata/CKAN-Multisite/issues/10#issuecomment-70068327 and am currently working to simplify it (work in progress - http://govhack2015.readthedocs.org/en/latest/).

Without wanting to impinge on CKAN PaaS providers, our self-hosted "docker-at-home" approach aims to cover the middle ground between a work-intensive source install and hosted off-site-only solutions like Datashades' CKAN galvanize or the off-site/on-site Datacats.

I would appreciate any constructive feedback on our currently datacats-based setup and will share my experiences there.