stat231-f20 / Blog-LetsGetFiscal

Repository for PUG Blog Project – Let's Get Fiscal
https://stat231-f20.github.io/Blog-LetsGetFiscal/
0 stars 0 forks source link

Update 1 #1

Open bfiume opened 4 years ago

bfiume commented 4 years ago

Group Name: Let’s Get Fiscal Group Members: Pedro Morais, Braedon Fiume, Matt Adams October 30, 2020

Update #1

General Idea: The COVID pandemic has forced many businesses to shut down, however, probably the most successful restaurants are those of fast-food models. Their significantly less required human contact and quick delivery are complemented by the restrictions of COVID. Following this, it would be interesting to examine the patterns of the most successful fast food chains using spatial data and text analysis. Using spatial data, we can show population densities and the amount of fast-food chains already located in areas. Using text analysis, we can show the most recent consumer sentiments of the restaurants in those specific locations. Then, we can narrow down a list of plausible and beneficial locations for the new stores to open, creating sustainable employment opportunities and stable cash flows for cities. Some of the questions we hope to address are: What are the 5 fast-food restaurants that have the most establishments in the United States? What geographical locations have the most or least amount of restaurants per capita? In specific locations, what proportion of consumers feel positive or negative about the fast food restaurants (via Twitter)? What cities most need income and employment opportunities?

Data:

This project requires significant information, and thus there are a variety of data sources that are needed. In our past project, our team ran into many complications with data scraping and infinite loading. However, this project has datasets that are easily accessible through Kaggle, a trusted data science company, and R packages. To tackle the spatial data visualizations of our project, there are three critical Kaggle datasets that can be downloaded manually through a csv file. The first is the dataset containing a list of 10,000 fast-food restaurants and their location with longitude and latitude locations. The second dataset from Kaggle contains a list of 754 cities across the United States, and their corresponding population as of 2016. Finally, there is a dataset that contains the U.S household income statistics of cities across the United States, allowing us to view cities that are in need of income. For text analysis, we will use a library from R called “rtweet”. This library and its packages are amazing, allowing us to request 18,000 tweets from the past 9 days based on certain words, like “#McDonalds” or “#Wendys”. Combining “rtweet” and the “get_sentiments” function, we can use text analysis and proportions to see the amount of positive and negative tweets for the top 5 fast food restaurants, as well as positive and negative tweets for specific locations! All these datasets will provide insight into recommending a geographical location for a chain's next restaurant location.

Resources: Fast-Food Restaurant Dataset: https://www.kaggle.com/datafiniti/fast-food-restaurants?select=FastFoodRestaurants.csv City Population/Densities: https://www.kaggle.com/mmcgurr/us-city-population-densities City Household Income Statistics: https://www.kaggle.com/goldenoakresearch/us-household-income-stats-geo-locations?select=kaggle_income.csv R Package “rtweet” Explanation: https://www.earthdatascience.org/courses/earth-analytics/get-data-using-apis/use-twitter-api-r/

Final Product: This project will be incorporating spatial data and text analysis to provide consultation for the top 5 fast food restaurants in the United States. We will be displaying spatial data visualizations of the amount of restaurants the fast food chains currently have across the country, as well as the population densities and wealth of the cities across the country. In addition, hopefully we can implement the text analysis of twitter into spatial data, representing which locations show the most positive and negative sentiments for the fast food chains. Examining all this information and data visualizations, hopefully we can recommend a suitable location for each of the top 5 fast food chains next restaurant. Our project will be in the form of a shiny app, and the user will be able to interact with the plots to examine the relationship between a specific restaurant and their locations and consumer sentiments across the country. The general population density and wealth of cities will be found on one singular tab for reference.

Schedule:

Week 1 (November 1): Tuesday, November 3: Have datasets downloaded and consolidated, have initial sentiment analysis data gathered. Thursday, November 5: Finalize tweet scraping and sentiment analysis data, begin initial data spatial visualizations for fast food chain distribution.

Week 2 (November 8): Tuesday, November 10: Create visualizations for sentiment analysis. Thursday, November 12: Create app for recommending the next suitable location for top 5 restaurant food chains. Saturday, November 14: Begin uploading graphics and formatting website.

Week 3 (November 15): Monday, November 16: Finalize write-up for website and prepare for presentation. Tuesday, November 17: Present and blow everyone’s mind!

katcorr commented 4 years ago

What an exciting plan, team! Ambitious with lots of questions and data sources -- but good to see you've already identified specific data sources with the data needed. Appropriate incorporation of spatial data and text analysis in a creative way.

I'm really looking forward to your blog post!

Update 1: 10/10

pbmorais commented 4 years ago

Update 2: Status Update We have mostly achieved all of the goals that we set out to have finished by this week, which were to have initial visualizations for the distribution of fast food restaurants and having finalized tweet scraping and sentiment analysis. We began by gathering data on city level income and population, as well as for 10,000 fast food chain locations across the US. Using this data, we were able to create choropleths that display how state income, population and fast food location vary by state. We created a measure for fast food density that shows the relative concentration of restaurants in a state relative to its population. On the tweet scraping and sentiment analysis side, we were able to scrape recent tweets that include the names of major fast-food restaurants in their text. However, we were not able to finalize sentiment analysis of these because we have run into an issue in which tweets don’t have locations, have fake “joke” locations or have locations in the wrong format, which has hindered our progress. We plan on working over this weekend to catch up and have an initial sentiment analysis for fast food restaurants per state working.

Next steps:
Apart from finalizing the sentiment analysis as displayed above, we want to move some of our data and visualizations from a state level to a county level. Furthermore, we realized using the top 5 fast food chains was too ambitious, so we plan on limiting our analysis to the top 3. Integrating the sentiment analysis with the county level data, we will be able to make visualizations that allow us to recommend locations for new fast food restaurant chains.
pbmorais commented 4 years ago

Whoops, not sure why the next steps turned out like that. Here they are: Apart from finalizing the sentiment analysis as displayed above, we want to move some of our data and visualizations from a state level to a county level. Furthermore, we realized using the top 5 fast food chains was too ambitious, so we plan on limiting our analysis to the top 3. Integrating the sentiment analysis with the county level data, we will be able to make visualizations that allow us to recommend locations for new fast food restaurant chains.

katcorr commented 3 years ago

Thorough update, and good plan moving forward.

Update 2: 5/5

mattadams23 commented 3 years ago

Update 3: Status Update

We had a little hiccup today! While we believed our data set contained the locations and information of 10,000 fast food restaurants across the US, we discovered today that the data set we downloaded is a sample, containing very little information that is not representative or an accurate reflection of the fast food industry; the main data set is unavailable as it costs around $1,000. Realizing this, we took class time today to pivot, and luckily were able to find a data set of all Starbucks locations around the world, including the US. Fortunately, we are able to keep the main structure of our project, but it was a challenge that took time. After adjusting to Starbucks, we created a thorough outline for our plan to reaffirm our goals and provide a guideline for the visualizations. As of now, we have a map of all the Starbucks around the world, a map of all the Starbucks in the US, and a map indicating the magnitude of Starbucks in certain locations. In addition, we were able to finalize our twitter sentiments data and display it on a map. This will be extremely helpful when transitioning to the county level, which is our next goal. Also, we discussed and best thought that we should have population statistics from a county level rather than cities. So, we found a new data set with all the populations of each county around the US. Then, we discussed that there should be another factor in addition to income per capita and population density for each county to decide where the next Starbucks location should be. We agreed on using unemployment statistics to find which county would be best benefitted by Starbucks. In summary, we have pivoted to solely find the next Starbucks location, have figured out twitter sentiment analysis, and have added the two new data sets of county unemployment and county population for greater accuracy in our findings.

katcorr commented 3 years ago

@pbmorais @mattadams23 @bfiume

Oh my! So the analyses and visualizations you created earlier (mentioned in update 2), you weren't able to use? Or were you able to adjust that code slightly to update to the Starbucks analyses? In any case, it sounds like you are back on a (new) track . . . and it is still an interesting and exciting analysis! Good work with an efficient pivot.

To clarify about the additional county-level variables you're collecting, are you planning to use a k-means clustering approach to cluster counties by income per capita, population density, unemployment, and number of Starbucks currently located in a county? Or how are you planning to identify where a new Starbucks should go? Completely fine if you're not using clustering, I'm just wondering . . .

Update 3: 5/5