Closed simonw closed 7 years ago
The florida-shelters.json file generated from http://www.floridadisaster.org/shelters/summary.aspx is a useful starting point... but the really interesting data comes from FEMA. It's a lot larger and potentially harder to work with, but if you're looking for an ambitious, impactful task try doing the above but for these JSON files instead:
Nit, should that be florida-shelters.json
@simonw?
Typo fixed :)
That "commit that referenced this issue" above is junk - one of the shelters mentioned in there happened to have #1 in its name. Unfortunately I don't think I can delete rogue commits-that-referenced-this-issue.
I have a weak preference for getting this solved in Python because that would make it trivial for me to hook it into the rest of the bot scraping infrastructure I have set up already - but getting this solved quickly is much more important so feel free to use whatever technology you are most comfortable with.
pefect, python and i are buds
I've built the first one of these, to compare the florida shelters site with the shelters listed on irma-api. Here's the implementation: https://github.com/simonw/irma-scrapers/commit/113b22183b9926bbd83ac842b874b70110c20c80
This repo now contains the (revision-tracked) output of a number of different scrapers.
We have a team of people working to keep https://irma-api.herokuapp.com/ up-to-date - that's the dataset that powers https://www.irmashelters.org/
We have avoided automatically adding shelters to irma-api because the data sources are often unreliable, and we feel it is better to have a human being manually check each entry, avoid duplicates, clean up the data and maybe use other sources to confirm that each shelter exists and is open.
BUT... there's no reason we can't have automated tools make suggestions to our human editors.
Here's the task: write code which pulls the current list of shelters from https://irma-api.herokuapp.com/api/v1/shelters and then pulls one of the scraped JSON files from this repo - then compares the two and tries to detect shelters that are missing from irma-api.
A useful starting point would be this file, which is scraped from http://www.floridadisaster.org/shelters/summary.aspx - the most recent JSON can be loaded from here: https://raw.githubusercontent.com/simonw/irma-scraped-data/master/florida-shelters.json
A useful script would do the following:
The way my irma-dupe-detection script works may be a useful inspiration: https://github.com/simonw/irma-scrapers/blob/59cf9906d1005972ab2e172ec494336b7b8b8434/irma_shelters.py#L174-L217