jasonasher / dc_doh_hackathon

Repository for the DC DOH Hackathon on September 23rd, 2017
5 stars 28 forks source link

Identify Duplicates and Erroneous Restaurant Inspection Reports #12

Open jasonasher opened 6 years ago

jasonasher commented 6 years ago

Start with the DC DOH Food Service Establishment Inspection report data in the /Data Sets/Restaurant Inspections/ folder in Dropbox.

This data was scraped using scripts that can be found on GitHub. This effort built upon previous scraping and processing work, including this and this

In order to be complete, the scripts scraped every file that was accessible, but a visual inspection of the data indicates that some of these may not be valid reports, and there are duplicates. For example, inspections with ids from 814384 to 819785 all seem to be duplicates of each other, though the first and last entries in that range have more data than the intermediate rows.

Some of these inspections have been actively linked from the main site here. These are presumed to be valid and are marked by a TRUE entry in the column known_valid. In the example above, 814384 is 'known valid' by this presumption, and the rest are not. But, we believe that many other rows are valid as well and we would like to include them.

Goal Determine which rows in this dataset should be included and excluded from the set, preferably by an automated script. Duplicates should be identified and removed, with selections made to preserve the greatest amount of data as is possible. You may be able to find people from DOH at the hackathon to help answer questions about the data. Additionally, you can see an example report here

When you are finished Submit a pull request on GitHub (or upload your scripts) Upload any files to Dropbox

Need more information? Flag Mohammed, Astrid, or Jason or ask your question in the comments below and we'll respond as soon as we can!

brycecf commented 6 years ago

I'll start working on this if anyone wants to help out

brycecf commented 6 years ago

@jasonasher Looking at the data set, I noticed how the data quality varies noticeably in some instances (for example, various businesses with just an apartment number as an address or just D.C.). My impression is that we don't have enough information about how validity is assessed to be able to accurately access the known_valid field and to reclassify it. The case you provided seems to be an example of that. 814384 looks to be the only one marked as valid (and contains additional values like the inspector's name).

Can I narrow the scope to deduping rows that are known to be valid?