open-austin / project-ideas

:bulb: A place to collect ideas for Open Austin projects
183 stars 25 forks source link

Analyze restaurant inspection data #24

Closed tyarkoni closed 5 years ago

tyarkoni commented 9 years ago

Last year, I did some analyses of the restaurant inspection scores released by the city:

https://github.com/tyarkoni/open-data-flights/blob/master/health-inspection/flight1.ipynb

Aside from updating to reflect scores from the last year, there are a bunch of other potentially interesting analyses that could be done with the data. I'm looking for collaborators interested in expanding on some of these analyses. It would also be nice if someone from the city could weigh in on the viability of obtaining other variables (e.g., deidentified codes for individual inspectors).

luqmaan commented 9 years ago

Maybe we can present the analysis as part of https://github.com/open-austin/atx-restaurant-scores

luqmaan commented 9 years ago

Also the analysis in your notebook is amazing! I wish I had seen this before.

tyarkoni commented 9 years ago

Thanks!

I think it would be pretty easy to export restaurant-level data in a form that can be easily served by the atx-restaurant-scores portal--though I'm not sure if anything in my analyses is useful in that respect...

luqmaan commented 9 years ago

https://github.com/Chicago/food-inspections-evaluation/blob/master/README.md

mateoclarke commented 8 years ago

Wondering if this type of analysis could be extended to food trucks...

This article goes into some new regulations for food truck vendors: http://www.mystatesman.com/news/news/local/new-regulations-taking-effect-for-food-trucks/npdMF/

Without spending any time looking, wondering if there is data available...

tyarkoni commented 8 years ago

Sure, all of the same analyses could just as easily be applied to food trucks (except maybe any spatial analysis, since trucks tend to move around). It does seem a bit weird that the food truck inspection data isn't released in the same file. Maybe they're worried about the 20% rate of inspection failure for food trucks contaminating the overall estimate. Doesn't explain why those data haven't been released separately, though.

mateoclarke commented 8 years ago

Could also be that for some esoteric reason, Food Truck inspection data is collected by Travis County Health & Human Service instead of CoA? Food Trucks aren't restricted to geography (like you said) so having the inspection at the county level could make sense? We'd need to do more research to figure out where that data lives (CoA vs TravCo) and then see if it couldn't be merged into the existing source data.

luqmaan commented 8 years ago

https://github.com/Chicago/food-inspections-evaluation

tyarkoni commented 8 years ago

Neat! I wanted to build a predictive model of inspection scores when I was working on my notebook, but for Austin, the data aren't really there to support it. The city doesn't release the nature of the violations that led to each score, which would be a huge factor. And as far as I know, DSHS doesn't make any data on food licenses publicly available (though I imagine one could obtain it via a request).

nvergos commented 8 years ago

@tyarkoni I created a series of predictive models (mostly classifiers) of inspection scores with varying success (but not higher than 70%) as a capstone project for a data science bootcamp I took over the past 3 months. I used the most updated version of the dataset, spanning 3 years of scores ending December 2015. I am in the process of polishing my notebooks before pushing them to GitHub. I was wondering whether I could use your exploratory analysis notebook as well in order to complement the project - with proper citation, of course.

tyarkoni commented 8 years ago

@nvergos, of course, please use the code for anything you like. I realize I forgot to add an explicit license to the project, so I just did that (MIT license).

Out of curiosity, what outcome were you trying to predict? And were you using within-restaurant features in the prediction (i.e., predicting a restaurant's future score in part based on knowledge of past scores), or in a completely out-of-sample way (i.e., you assume you only know the name of the restaurant, its address, maybe the kind of cuisine, etc.)? The latter seems more challenging given the nature of the present data, but if you can hit 70% for a meaningful outcome with these data (assuming reasonable class balance), that actually sounds better than I would have intuitively expected.

nvergos commented 8 years ago

I started with an effort to predict whether a restaurant will pass or fail inspection given its location. Unfortunately there is a HUGE class imbalance problem since only 1% of the dataset rows actually correspond to failing restaurants in my analysis, so I reformulated my classifiers to decide whether a restaurant will achieve a "pristine" score (higher than 90) or not. I will be pushing the rest of my notebooks to github within the next couple of days - the most interesting part (for me) is the Naive Bayes classifier on restaurant names and streets. It is rather disappointing that this dataset is fairly poor compared to other cities' posted data sets.

luqmaan commented 8 years ago

Lets ask for better data. https://github.com/open-austin/liberate-the-data

luqmaan commented 8 years ago

@nvergos @tyarkoni Do you guys want to pitch/share your ideas/what you're working on at the next hack night? This would be Monday March 1, and would be just 2-3 minutes of you talking. http://www.meetup.com/Open-Austin/events/228274917/

tyarkoni commented 8 years ago

Sounds good, but I don't think I'll be able to make it next week. Maybe the one after!

luqmaan commented 8 years ago

Tuesday April 1?

nvergos commented 8 years ago

@tyarkoni @luqmaan I will be happy to present - both March 1st and April 1st work for me, I'm sure I will be done with my analysis by then

rhoadescw commented 8 years ago

I’m interested in translating and presenting data in a way that is valuable to consumers. Texas’ method of health inspection scoring is a step toward making data consumable, and as most cities and counties that publish scored health inspections say, a single score is not indicative of an establishment’s dedication to a healthy environment. I’ve worked on a project that pulls health inspection scores and visually presents an establishment’s score trend. The intent is, in part, to discover ways of making public data genuinely consumable and valuable by consumers. In this case, not just consumers of restaurants, but hospitals, schools, and any place that includes a food service.

If you are interested, the work is at http://goodburp.com. e.g., you can search for establishments that have the word ‘buffet’ in their name in Austin, TX. And yes, buffet’s are pretty bad (try ‘oyster’ too).

I’ve learned several lessons about consuming pubic data. Is this project still active, or have you all had additional thoughts on what to do next?

tyarkoni commented 8 years ago

I can't speak for other people, but this was just a small side project for me; I don't have any plans to do anything else with these data. That said, I would be apprehensive about attempting to quantify trends in the data at the single restaurant level absent some sense of the reliability of those trends. I think one runs a serious risk of detecting patterns in noise, in the sense that, e.g., a series of ratings that goes 97 --> 92 --> 94 --> 89 would be fit well by a decreasing linear trend, but is almost certainly within any reasonable margin of error. For example, your website lists an establishment like this as "good but slipping", and it seems odd to me to say that going from 100 to 99 is slipping--it's very unlikely that a change like that is at all a reliable indicator of any change in the restaurant's actual sanitation standards. To support that claim in a responsible way, I think one would need some meaningful external criterion to predict from these scores (e.g., how often people get sick from eating at an establishment). But I haven't seen any such data emerge so far (at least for Austin).

TheSecMaven commented 7 years ago

I am new to the group and just recently joined the slack, but a group that I was in at the University of Louisville developed this project based off of the Chicago health inspection project linked above during derbyhacks2. We had relatively good results and the project is actually being utilized by the city of Louisville in the future. https://github.com/PilgrimShadow/DerbyHacks17

Would love to help or take a lead role in this project where possible.

werdnanoslen commented 7 years ago

Hi @mkkeffeler, do you think @chip-rosenthal's https://github.com/open-austin/atx-restaurant-scores is something you'd like to contribute to, or would you go in a different direction?

TheSecMaven commented 7 years ago

@werdnanoslen It looks like @chip-rosenthal's project is simply allowing for the basic querying of available health inspection score data. In my project, we leveraged that data amongst a number of other features (such as 311 call data) to attempt to predict future health inspection scores.

The idea was that if we could predict these scores somewhat reliably, as was proven possible in the Chicago project (https://github.com/Chicago/food-inspections-evaluation), we could give that list to city health inspectors for them to look into, versus the current strategy of randomly picking restaurants to drop in on.

The goal was to obviously avoid food borne illnesses from becoming a problem, and by getting to the restaurant sometimes weeks before such a thing is allowed to happen ( I.e predicting a possible health inspection failure) that is beneficial. Also we could potentially inform restaurants owners in advance that we believe they are at risk for food borne illnesses and such, and they could take extra preventative measures to avoid this, thus resulting in improved conditions for local residents.

werdnanoslen commented 7 years ago

Cool, @mkkeffeler could you please post this into a new project idea? That'd make it easier to help you find others who could work on this.

mscarey commented 5 years ago

Nice .ipynb.