jldbc / pybaseball

Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
MIT License
1.26k stars 333 forks source link

Created a new function to retrieve box scores from baseball reference… #241

Open demilio76 opened 2 years ago

demilio76 commented 2 years ago

…. Quick example:

datetime_object = datetime.strptime('May 05 2021', '%b %d %Y') visitor_batting_df, home_batting_df, visitor_pitching_df, home_pitching_df \ = box_score('OAK', datetime_object, 0) print(f"{visitor_pitching_df.loc[0, 'Pitching']} vs {home_pitching_df.loc[0, 'Pitching']}")

demilio76 commented 2 years ago

I had to parse through comments to get the data I wanted due to how bbref sets up their boxscore pages. Alternatively, I have a version that uses Selenium and a ChromeDriver which works a little cleaner (tables aren't in comments post-page load) but for now am submitting this version to avoid a new dependency

schorrm commented 2 years ago

@TheCleric @bdilday if either of you can take a look?

bdilday commented 2 years ago

I had to parse through comments to get the data I wanted due to how bbref sets up their boxscore pages. Alternatively, I have a version that uses Selenium and a ChromeDriver which works a little cleaner (tables aren't in comments post-page load) but for now am submitting this version to avoid a new dependency

I think I'd rather have the cleaner, selenium, version. It doesn't seem like a crazy dependency for a library who's job is largely to scrape the web.

@schorrm @TheCleric any thoughts?

TheCleric commented 2 years ago

I had to parse through comments to get the data I wanted due to how bbref sets up their boxscore pages. Alternatively, I have a version that uses Selenium and a ChromeDriver which works a little cleaner (tables aren't in comments post-page load) but for now am submitting this version to avoid a new dependency

I think I'd rather have the cleaner, selenium, version. It doesn't seem like a crazy dependency for a library who's job is largely to scrape the web.

@schorrm @TheCleric any thoughts?

@bdilday I'm not a fan of selenium for this since it doesn't need anything like JavaScript. As it is we've done similar things to this with just the xpath parser which can be used to parse into HTML comments.

EDIT: I found another PR where I provided some example code for something similar: https://github.com/jldbc/pybaseball/pull/137#discussion_r496769328

demilio76 commented 2 years ago

I was playing around with the Selenium version and have changed my mind and now agree with not using that. Main reason for my change of heart was that I didnt fully realize how much slower Selenium was until I ran a batch of calls. E.g. to get all 162 box scores for the Dodgers games this past season, it took the non-Selenium version 62 seconds but the Selenium version took around 15 minutes.

schorrm commented 2 years ago

For 15 min vs 62 seconds, that's a pretty clear winner here, even if Selenium would be cleaner.

BrayanMnz commented 1 year ago

This has been opened through a year - are we merging this into the project or not? @schorrm @tjburch