maxblee / golf-distance-analysis

This is a repository for analyzing changes in golf shots over time, for MSE 125 at Stanford
MIT License
1 stars 0 forks source link

Acquire Data #2

Closed maxblee closed 4 years ago

maxblee commented 4 years ago

Before we perform any analysis, we need to acquire data for our project. This may take the form of finding some place(s) that has/have good golf data (ideally at the level of individual tournaments). Or it may require scraping. Essentially, there are two parts to it:

Here's the data we need (from my understanding; please correct if I'm wrong):

Agfritz commented 4 years ago

A few comments/questions:

  1. Is it better to use Data Golf's True Strokes Gained (same as regular strokes-gained adjusted for field strength), or conduct our own adjustments for field strength?

  2. What are the most important covariates?

    • Would weather be a consideration?
    • Do we need to collect data on course length or other course characteristics as a covariate?
  3. I found a useful article on scraping data here. Instead of scraping the data directly in our code, it recommends downloading the HTML files. I am working on downloading the 2019 HTML files from PGA for each SG category and the money list for each tournament and storing them in the raw data folder. I'll start on code to scrape them after that.

Agfritz commented 4 years ago

A few comments/questions:

  1. Is it better to use Data Golf's True Strokes Gained (same as regular strokes-gained adjusted for field strength), or conduct our own adjustments for field strength?
  2. What are the most important covariates?
  • Would weather be a consideration?
  • Do we need to collect data on course length or other course characteristics as a covariate?
  1. I found a useful article on scraping data here. Instead of scraping the data directly in our code, it recommends downloading the HTML files. I am working on downloading the 2019 HTML files from PGA for each SG category and the money list for each tournament and storing them in the raw data folder. I'll start on code to scrape them after that.

Option for bulk scraping by tournament: https://github.com/zachwill/golf/blob/master/pga.py Another resource on scraping PGA with beautiful soup: https://brianchesley.wordpress.com/2014/11/18/dissecting-the-tiger-woods-effect-with-beautiful-soup-and-pandas-pt-1/

maxblee commented 4 years ago

I added the Python file to let us acquire the data. Could someone help me identify what stats (preferably by sending me links) we need to get + what years we need to get them for / can get them for?

Agfritz commented 4 years ago

I added the Python file to let us acquire the data. Could someone help me identify what stats (preferably by sending me links) we need to get + what years we need to get them for / can get them for?

Thanks Max! I think for our first pass it is the 2004-2019 official money list, SG:Putting and SG:off-the-tee

Would it be helpful for me to make a table for all the tournament names to IDs? It looked like that's an input we need for the script?

maxblee commented 4 years ago

No, the scraper already collects that information. I can adapt it slightly so it stores that information, too, if we wind up needing it, e.g. to join data. But thanks!