simonw / disaster-data

Data scraped by https://github.com/simonw/disaster-scrapers
36 stars 10 forks source link

Write code to detect shelters that are missing from irma-api #1

Closed simonw closed 7 years ago

simonw commented 7 years ago

This repo now contains the (revision-tracked) output of a number of different scrapers.

We have a team of people working to keep https://irma-api.herokuapp.com/ up-to-date - that's the dataset that powers https://www.irmashelters.org/

We have avoided automatically adding shelters to irma-api because the data sources are often unreliable, and we feel it is better to have a human being manually check each entry, avoid duplicates, clean up the data and maybe use other sources to confirm that each shelter exists and is open.

BUT... there's no reason we can't have automated tools make suggestions to our human editors.

Here's the task: write code which pulls the current list of shelters from https://irma-api.herokuapp.com/api/v1/shelters and then pulls one of the scraped JSON files from this repo - then compares the two and tries to detect shelters that are missing from irma-api.

A useful starting point would be this file, which is scraped from http://www.floridadisaster.org/shelters/summary.aspx - the most recent JSON can be loaded from here: https://raw.githubusercontent.com/simonw/irma-scraped-data/master/florida-shelters.json

A useful script would do the following:

The way my irma-dupe-detection script works may be a useful inspiration: https://github.com/simonw/irma-scrapers/blob/59cf9906d1005972ab2e172ec494336b7b8b8434/irma_shelters.py#L174-L217

simonw commented 7 years ago

The florida-shelters.json file generated from http://www.floridadisaster.org/shelters/summary.aspx is a useful starting point... but the really interesting data comes from FEMA. It's a lot larger and potentially harder to work with, but if you're looking for an ambitious, impactful task try doing the above but for these JSON files instead:

commadelimited commented 7 years ago

Nit, should that be florida-shelters.json @simonw?

simonw commented 7 years ago

Typo fixed :)

That "commit that referenced this issue" above is junk - one of the shelters mentioned in there happened to have #1 in its name. Unfortunately I don't think I can delete rogue commits-that-referenced-this-issue.

simonw commented 7 years ago

I have a weak preference for getting this solved in Python because that would make it trivial for me to hook it into the rest of the bot scraping infrastructure I have set up already - but getting this solved quickly is much more important so feel free to use whatever technology you are most comfortable with.

mjturtora commented 7 years ago

pefect, python and i are buds

simonw commented 7 years ago

I've built the first one of these, to compare the florida shelters site with the shelters listed on irma-api. Here's the implementation: https://github.com/simonw/irma-scrapers/commit/113b22183b9926bbd83ac842b874b70110c20c80