timwis / jkan

A lightweight, backend-free open data portal, powered by Jekyll
https://jkan.io
MIT License
219 stars 309 forks source link

Add contrib code for scraping #104

Closed patcon closed 8 years ago

patcon commented 8 years ago

A Scrapy pipeline could be used to help people scrape organization/dataset and some metadata from city data portals.

Once I get this sorted out for myself a little bit better, happy to contribute it back.

This could also involve creating a custom storage backend that pushes scraped files directly to the github pages site. This could run regularly via heroku scheduler.

Scrapy also has an S3 storage backend, and it could make more sense to use that, but I'd hate to lose all the nifty gatekeeper stuff :)

Ref: https://github.com/CivicTechTO/scrapers-to-data-portal

timwis commented 8 years ago

Sounds like a cool project. I'm not completely clear, though, on what you mean by "add contrib code for scraping"? You could (and it looks like you have) write a scraper that creates the dataset .md files, or you could write one that creates a data.json file and use @JJediny's plugin to pull it in. Or are you talking about scraping the data itself?

patcon commented 8 years ago

I'm just now having time to look into @JJediny's plugin, and it looks pretty rad, especially if it allows multiple endpoints. Thanks for the pointer. I'll investigate more later

timwis commented 8 years ago

Sounds good. I'll close this issue for now, but feel free to reopen if there's more to discuss.