mheadd / communitydata.io

A centralized data portal bringing together data from community run data portals across the U.S.
http://communitydata.io
20 stars 10 forks source link

Add Harvester Extension #2

Closed mheadd closed 9 years ago

mheadd commented 9 years ago

Add the harvester extension to allow remote harvesting of metadata from community managed sites.

https://github.com/ckan/ckanext-harvest

deniszgonjanin commented 9 years ago

I can help with this, but this is mostly sysadmin work - not so much code. Would need server access

mheadd commented 9 years ago

Awesome! I've installed CKAN a few times on plain vanilla Ubuntu, but I've never gotten the harvester extension working properly.

The working instance I set up of CKAN is a few months old and there may be a more recent version we can use. Current details here.

It may make sense to get a fresh instance set up where multiple users can access the server to assist. Let me see about getting that set up. Back to you shortly.

scuerda commented 9 years ago

I wonder, it there a way to separate the development of the harvesters from the process of collecting the information required from each portal as a way to let non-devs participate?

deniszgonjanin commented 9 years ago

@mheadd sweet. If you need any help, let me know. I must've set up countless CKANs by now and I help host and manage a number of big ones

mheadd commented 9 years ago

@scuerda We definitely need to do outreach to the managers of community portals. Pretty sure that they'll all need to provide a data.json file as well for the harvester to work.

Non-devs can most definitely help with outreach and promotion.

Other ideas?

mheadd commented 9 years ago

@deniszgonjanin You've probably got way more experience installing + managing CKAN than I do. I know pretty much just enough to get into trouble. ;-)

Did you want to take the lead on getting an instance of CKAN 2.3 set up, so that we can begin to get the harvester installed and set up?

I have the communitydata.io domain and can point to a new instance if you have time to get one set up.

Thoughts?

scuerda commented 9 years ago

@mheadd I was thinking that an approach similar to that used by https://github.com/unitedstates/contact-congress using yamls to spec out the harvester inputs might work to separate the development of the harvester from the collection of information. It would also take the burden off the portal managers for pulling information together. @deniszgonjanin Would such an approach save much time?

deniszgonjanin commented 9 years ago

@scuerda it just might. harvesting data from a bunch of portals is no doubt going to be a handful. Some of them are already CKAN, and since they're community sites, their schema is probably close to the default CKAN schema. These will be fairly easy.

I'd say we get set up first, run some tests and see how difficult the process is. Then we can figure out what's the best way to take the pain away. I agree we want to decouple the configuration of the harvesters from the developers as much as possible.

deniszgonjanin commented 9 years ago

@mheadd yes, I can set one up in no time. We have a couple of options:

I can set up an instance using datacats, which is my preferred way but I don't want to force it on anybody. We have customers using it, so it's ready for production, and it would allow multiple people to push-deploy and easily share projects, like on heroku. It automatically backs-up everything. It's also free.

The other option I like is to just set up a base CKAN instance on digital ocean, old school

I'm really happy to do either

mheadd commented 9 years ago

I vote for datacats. That's been on my radar for a bit to play with anyway - would be a great way to get started. Plus that will help support the collaborative approach to this.

Appreciate your help on this front. Beers will be owed.

waldoj commented 9 years ago

I was thinking that an approach similar to that used by https://github.com/unitedstates/contact-congress using yamls to spec out the harvester inputs might work to separate the development of the harvester from the collection of information.

:+1: I have had excellent experiences with pulling metadata and even some logic out of software and putting it in YAML. It really drives down the MVP for contributions.

I can set up an instance using datacats, which is my preferred way but I don't want to force it on anybody.

:+1: :)

deniszgonjanin commented 9 years ago

Site is running at http://community.datacats.io/. @mheadd if you create a CNAME/ALIAS for communitydata.io that points to community.datacats.io., it should be good to go.

deniszgonjanin commented 9 years ago

re: harvester, I'm at a hackathon on Saturday in Niagara, so now I have a project to work on. I'll see if I can get as far as harvesting some of the CKAN based community sites.

mheadd commented 9 years ago

@deniszgonjanin Boom - http://communitydata.io.

I ca be online this weekend, so if I can help with the harvester, let me know.