What else needs to be done? What's missing?

waldoj commented 9 years ago

Did we get everything? Does anything else need to be done in order to accomplish our goal?

jqnatividad commented 9 years ago

Beyond making CKAN more cloud-ready, you may want to take this opportunity to address some items on the wishlist as well

https://github.com/ckan/ideas-and-roadmap

As a lot of these ideas already come from the CKAN global community and these signals highlight the pain-points and gaps from various CKAN implementations.

jqnatividad commented 9 years ago

The multisite project will go a long way towards democratizing Open Data for sure, and having mechanisms to make sure that our Open Data future works like the web (data portals linked to one another), you may also want to think about the data layer, and what ODI can do in this project, to encourage federation.

Some other items that come to mind:

data.json, and Project Open Data compliance (we forked Joshua's HHS implementation as its HHS specific - https://github.com/HHS/ckanext-datajson. Perhaps, make it more generic and easier to configure with a web interface?)
schema.org integration - especially now that it supports http://schema.org/DataCatalog and http://schema.org/Dataset). People look for data using search engines.
Analytics/API mgmt dashboard dashboard - with API tracking, and quota/throttling capabilities. In opendata.city, we're doing this by extending existing Google Analytics support to recognize API use, as well as integrating 3scale API usage. We're even thinking of getting some metrics from GA and automatically exposing it as Open Data for each instance.
you may also want to look into extensions like https://github.com/open-data/ckanext-scheming which promotes schema sharing/standardization amongst CKAN instances

rgradeck commented 9 years ago

Would be great to see a blog post about what you learned from this process once you're a little further on your way.

waldoj commented 9 years ago

You bet!

jqnatividad commented 9 years ago

You may want to also consider some form of facilitated federation. That is, as each new CKAN multisite instance is spun up, the publisher can optionally be prompted to be included in a central registry.

This registry can show the publisher's information. The publisher can even ask that his catalog metadata is available for harvesting.

Future iterations of the registry can even support federated search across catalogs.

waldoj commented 9 years ago

Yes. Dataset registries are really important. This is quite likely a thing that is missing, for the intended purpose of this project. After all, the host of each CKAN Multisite server surely wants to keep up with all datasets hosted within the sub-sites. (I know that's not quite what you're describing, but it's the same mechanism.) I intend to bundle the ckanext-datajson extension with this, so of course the host could just poll each site's /data.json file, but a mechanism to (optionally!) ping one or more URLs when a dataset is added or updated seems potentially very useful.

jqnatividad commented 9 years ago

If this project achieves its main goal, its foreseeable that there will thousands, if not millions of CKAN data repositories, hopefully, organically linked and federated with each repo ideally closest to and populated by the data producer.

So registries are really a necessary part of the project as discoverability issues will naturally follow this explosion of data repos.

From the data consumer side, ODI may want to think about how to create facilitated discovery.

Beyond making sure that ckanext-datajson is bundled in, an effort should be made to automate the creation of expressive catalog/dataset metadata the default, rather than just the barebones, manually-entered metadata.

ODI may even want to go further than Project Open Data guidance and include additional metadata that can be automatically computed and used for discovery - like bounding boxes for geospatial data, and the date range of a dataset.

In our implementation, we try to do this by contextualizing datasets through time and place, along with the usual tags and good metadata publishing practices as espoused by Project Open Data.

rgradeck commented 9 years ago

One other thing that comes to mind, but not necessarily for this 1st round is the ability for administrators to scan datasets within the repository for PII or other sensitive info (ideally prior to going public). We can encourage good practices re. records management, but an additional layer of protection would be welcome.

wardi commented 9 years ago

@jqnatividad some of this is related to https://github.com/ckan/ideas-and-roadmap/issues/48 and your ticket https://github.com/ckan/ideas-and-roadmap/issues/59

jqnatividad commented 9 years ago

Yes. Looking forward to https://github.com/boxkite/ckan-multisite.

Hopefully, it will lay the groundwork for these ideas along with making admin easier in general https://github.com/opendata/CKAN-Multisite/issues/8 and multsite admin can be extended to manage other CKAN ini settings as well.

opendata / CKAN-Multisite-Plans

What else needs to be done? What's missing? #3