Define Dataset: Restaurant Inspections (Health)

emily878 commented 9 years ago

Define the essential substantive elements of the core Restaurant Inspections dataset. What are the components that it must minimally include? Do we have a dataset that we could hold up as a model?

waldoj commented 9 years ago

As a note to our future selves, we're going to want to look at HDScores' data census to start figuring out the state of things, since @MatthewEierman and team have already done the hard work on this.

MatthewEierman commented 9 years ago

Thank you Waldo, for your kind words on our hard work.

Please let me know how I or my team can help, or provide feedback based on our 3+ year experience in this space. For those who don't know me I'm also a trained chef, Johnson & Wales Alumni with almost a decade of restaurant experience. So I understand both food and data sides of this discussion. I've attached dropbox links to show our data architecture... Which will be available via our open api in Q2 2015.

https://www.dropbox.com/s/iw2jhdyc71s20pi/Establishment-Sample.txt?dl=0 https://www.dropbox.com/s/iyksvu5ueggtrm9/Inspection-Sample.txt?dl=0 https://www.dropbox.com/s/hz13bpwv2ref38i/Violation-Sample.txt?dl=0

This is a sample of what the data architecture looks like.

The first file is the Establishments entity. The columns of importance are the HDScoresRankingPercent which is what gives the score. The Establishment are the “post” geocoded data with the verified addresses, etc…

The second file is the Inspections are the source data and can be tied to the establishment by Inspection.establishment = Establishment.id.

The third file is the Violations are the violations per inspection, again Violation.inspection = Inspection.id

//

In my opinion, The issue at hand with inspection data is one of data variety and veracity (volume with the historical backlog). There are a few versions of federal food code, then state, local and municipal food code variants. Each Jurisdiction posts the inspection data in their own unique data format (paper or digital), technology, data policy/rights and language issues (on international scale).

The goal of jurisdictions should be to release data in a consistent format consistently, have clear data rights/policies/access. The data formats/technologies may vary, the information really doesn't. There are only 6-8 common data formats with variants, we have seen around the US so far.

The biggest issues is the nearly 1/3 of jurisdictions that post in paper only. Getting the data into a digital, and preferably machine readable format is the challenge. As paper has a higher costs, in storage on the square foot, electric, and analytics/processing costs.

Open visualizations (mobile/embed-able web apps) should be free to consumers and any jurisdictions that releases data. News/Media organizations and Tech Companies should be presenting this data with in those mobile/embed-able apps or api's on their own interface. Analytics is the goldmine, not the visualization... once all the data is in a single unified data set.

In my opinion, Yelp's LIVES standard was good start, but in my opinion CSV is a great output format but not a platform. Henry Ford said "If I had asked my customers what they wanted they would have said a faster horse". The LIVES Standard is a faster horse rather then a new innovation. Faster Horses are wasteful, while similar to the existing standard. Like real horses, Faster horses will stand idle three quarters of the time, waiting for direct orders and supervision to work, eating up lots of time and resources. Health department, Yelp and others got together and created what became the LIVES standard. Forcing government agencies, or code for america, or others to do all the work, then to give up their data standards to publish in the set LIVES standard. These are a few reasons for their slow adoption across jurisdictions.

Those networks have the audience of users (YELP, TripAdvisor, Grubhub, Opentable, Google/Bing (Maps), Factual, Foursquare, Urbanspoon-(Zomato), Mapquest), and should access the data via an API... To increase the value to their users not creating the standard... Developed by Experts who have worked on the hard problems of data aggregation, standardization, normalization, enhancing, scoring, storing and distributing.

Not a sales pitch just a quick backstory to what we have done.

HDScores has built a Data Aggregation, Management & Distribution Platform that takes diverse datasets with a common corpus, regardless of technology, language, data format (paper, images to any web technology) or variety issues on a international scale. HDScores makes data consumable & accessible for everyone! (via apps, apis, or interfaces). HDScores use inspection data to drive a fact based insights verse the currently used questionable consumer reviews. We have access to 15 countries inspection data. Our goal is a consumable format for all (easy to read, search and understand).

Currently HDScores has indexed/searchable in our database: 532,000+ Establishments of roughly 1,500,000+ Establishments in the US. 3,000,000+ Inspection Reports 9,000,000+ Line Item Violations Available in an iOS, Android (Currently) Web app and Open API (restful) in Q2 2015.

Matthew Eierman Matthew@HDScores.com

waldoj commented 9 years ago

Wow, @MatthewEierman, that's a heck of a detailed response. :+1: Thank you! Do you think LIVES' spec represents a good minimum viable data product? Your data is clearly a great deal more rich, but in trying to identify an MVP, seems like there's some sense to shooting lower.

MatthewEierman commented 9 years ago

@waldoj, I never shoot for mediocre... Per HDScores conversations with Yelp LIVES Product Manager in January. They don't have a lot of interest in expanding the program quickly. They are committed to LIVES but besides bring in Socrata Data, which is publicly available. They haven't done much to grow it. We offered to give our all data for free to YELP if it links back to us for detailed data. Their was little interest and wonder on how we acquired so much data. We shared the same documents with them.

As a MVP, (A real weak - Maybe) If your trying to use the basic data architecture, you will run into problems with a large number of jurisdictions, as you are missing to many rows. Which is limiting a potential nationwide program. LIVES just hasn't gone through enough jurisdictions. If you are attempting to use LIVES inspection data to grow the platform (No). As government jurisdictions update and pushes the data to them, the data is often outdated by one inspection and especially adding the re-inspections data in a timely manner. Data freshness, and accuracy is important in this data set.

waldoj commented 9 years ago

Well, we're just looking to ID the fields that need to be present, to have some minimum level of data published at which we can say "yes, that is, in fact, adequate to count as publishing restaurant inspection data." So it's not about the LIVES spec—I'm just wondering if we can use the fields within LIVES as a good baseline, if they include the bare minimum of data necessary to qualify as a government publishing restaurant inspection data.

MatthewEierman commented 9 years ago

Yes If you are just using the fields to justify the minimum level of data that would be a decent baseline for restaurant inspection data. The restaurant id would have to be a specific unique id to avoid id conflicts between jurisdictions...

waldoj commented 9 years ago

That's really great to know—thank you, @MatthewEierman!

waldoj commented 9 years ago

LIVES is so flexible that their minimum spec doesn't actually include any violations, or even the outcome of inspections. We really have to build on their minimum to add that as a field. So that leaves us with:

business ID
business name
business address
inspection date
inspection outcome (score, rating, action, etc.)

sunlightpolicy / State-Open-Data-Census

Define Dataset: Restaurant Inspections (Health) #29