kndunlap / baseball-web-scrape

Scraping for baseball data!
1 stars 0 forks source link

question #1

Open jacobakaye opened 9 months ago

jacobakaye commented 9 months ago

what do I need to download to do this. i tried and it didn't work

kndunlap commented 9 months ago

Jacob,

What exactly are you trying to download? If you provide more detail on what you're looking for, I can try and help you out.

Some of this code is unfinished as well.

jacobakaye commented 9 months ago

The R files to web scrape. And get the data that you have in the csv's

kndunlap commented 9 months ago

For the CSVs you can click on the file and then download them. For the R file you can either download the files or copy paste into RStudio.

jacobakaye commented 9 months ago

My bad... I wasn't clear. From the R files, how can I get the data from the CSVs without actually downloading the CSVs

kndunlap commented 9 months ago

Oh sorry. I don't have a way to do that yet If you want to use them you'll have to download them directly.

jacobakaye commented 9 months ago

Got it. Once the season starts, how often will those be updated?

kndunlap commented 9 months ago

I don't know if I will update it. This was more of a one-off project to model some stats based on the 2023 season. Is there something specific you'd like to see? I'm open to more ideas to practice my R skills.

jacobakaye commented 9 months ago

I guess a way to show something regarding the 2024 season -- can be updated throughout the season.

kndunlap commented 9 months ago

One thing I guess I could do is change the url to link to live stats, but savant won't have posted those pages for the 2024 season yet. Once we get closer to the season I might work on that if I have time. Are you trying to do anything with the upcoming 2024 data?

jacobakaye commented 9 months ago

Yes, I'm trying to make a table/viz that can be replicated throughout the season. The dream is to show a bunch of team ranks and how they've changed throughout the past week.

kndunlap commented 9 months ago

I see. That might not be too difficult once I find out where good team stats are stored. For that would you want team stats or player stats and then make team stats out of the player stats?

jacobakaye commented 9 months ago

Fangraphs is good https://billpetti.github.io/baseballr/reference/fangraphs.html

jacobakaye commented 9 months ago

I would want team stats

kndunlap commented 9 months ago

That fangraphs package looks good. It seems better than the code I have.

jacobakaye commented 9 months ago

Yes - its great. How did you get the OPS+ xlsx document?

kndunlap commented 9 months ago

For some reason that document isn't working for me, downloading it just an empty excel file. Do you see anything within the document?

jacobakaye commented 9 months ago

Nope. Empty too.

jacobakaye commented 6 months ago

Hi - is there a way to scrape OPS+ and ERA+ from baseball reference?

kndunlap commented 6 months ago

You can use the baseballR package. Are you looking for a specific team or the whole league?

jacobakaye commented 6 months ago

Specific team. I wasn't able to find the plus stats in the package. Maybe I'm missing something?

kndunlap commented 6 months ago

Try this code:

library(tidyverse) library(rvest) bbref_url <- "https://www.baseball-reference.com/teams/DET/2024.shtml" html_bbref <- read_html(bbref_url) bbref_tables <- html_bbref |> html_table() hitters <- bbref_tables[[8]] pitchers <- bbref_tables[[9]]

I'm a tigers fan so I used tigers as an example but you can change that link. Let me know if it works.

jacobakaye commented 6 months ago

Yes, it worked. Thank you. How do I get it on a team level?

kndunlap commented 6 months ago

Do you mean specific teams? A couple examples of changing the link, which lets you go by year too.

PIT: "https://www.baseball-reference.com/teams/PIT/2024.shtml" NYM: "https://www.baseball-reference.com/teams/NYM/2021.shtml"

BTW can you please try out my app. You pick a team and it graphs places based on their slugging vs expected slugging. https://kndunlap.shinyapps.io/testapp/

jacobakaye commented 6 months ago

Just want to be able to create graph with ERA+ vs OPS+

kndunlap commented 6 months ago

I'm not sure how to get every team at once. You could import all 30 teams with different variable names and then use rbind or cbind to merge the dataframes together. Then you'd have 1 big dataset of all the players.

jacobakaye commented 6 months ago

Gotcha. Wouldn't combining these work?

OPS+ (https://www.baseball-reference.com/leagues/majors/2024-standard-batting.shtml) ERA+ (https://www.baseball-reference.com/leagues/majors/2024-standard-pitching.shtml)

kndunlap commented 6 months ago

I tried that but for some reason the web scraping package doesn't pull out the table of individual batters, only the first one on the page of teams.

jacobakaye commented 6 months ago

But if I just want teams on both pages then that should work right?

kndunlap commented 6 months ago

I think so - sub in the link and it probably will work great. Let me know how it works for you. Code below

library(tidyverse) library(rvest)

bbref_url <- "https://www.baseball-reference.com/leagues/majors/2024-standard-batting.shtml" html_bbref <- read_html(bbref_url) bbref_tables <- html_bbref |> html_table() hitters <- bbref_tables[[1]]

bbref_url <- "https://www.baseball-reference.com/leagues/majors/2024-standard-pitching.shtml" html_bbref <- read_html(bbref_url) bbref_tables <- html_bbref |> html_table() pitchers <- bbref_tables[[1]]

jacobakaye commented 6 months ago

Yes it worked. I noticed this- not sure how to fix and I'm sure its very simple to do so. The header/first row repeats twice. How can I remove the second one?

Screenshot 2024-05-28 at 4 40 05 PM
kndunlap commented 6 months ago

Good question. You can do this, assuming your table is named pitchers, and the row you want to get rid of is the first one.

pitchers |> slice(-1)

jacobakaye commented 6 months ago

So this code completely works, but for whatever reason in the 'view' page, it doesn't sort properly....

library(tidyverse)
library(rvest)
library(mlbplotR)
library(dplyr)
library(ggplot2)

bbref_url <- "https://www.baseball-reference.com/leagues/majors/2024-standard-batting.shtml"
html_bbref <- read_html(bbref_url)
bbref_tables <- html_bbref |>
  html_table()
hitters <- bbref_tables[[1]]

hitters <- hitters |>
  slice(-31:-33)

bbref_url <- "https://www.baseball-reference.com/leagues/majors/2024-standard-pitching.shtml"
html_bbref <- read_html(bbref_url)
bbref_tables <- html_bbref |>
  html_table()
pitchers <- bbref_tables[[1]]

pitchers <- pitchers |>
  slice(-31:-33)

hitters_selected <- hitters %>% select(Tm, OPS_plus = `OPS+`)
pitchers_selected <- pitchers %>% select(Tm, ERA_plus = `ERA+`)
data <- inner_join(hitters_selected, pitchers_selected, by = "Tm")
view(data)

In the image, OPS+ is sorted descending

Screenshot 2024-05-28 at 5 38 51 PM
kndunlap commented 6 months ago

R was treating your number columns as characters and sorting them "alphabetically". I added to the end of your code to change those columns to numerics.

library(tidyverse) library(rvest) library(mlbplotR) library(dplyr) library(ggplot2)

bbref_url <- "https://www.baseball-reference.com/leagues/majors/2024-standard-batting.shtml" html_bbref <- read_html(bbref_url) bbref_tables <- html_bbref |> html_table() hitters <- bbref_tables[[1]]

hitters <- hitters |> slice(-31:-33)

bbref_url <- "https://www.baseball-reference.com/leagues/majors/2024-standard-pitching.shtml" html_bbref <- read_html(bbref_url) bbref_tables <- html_bbref |> html_table() pitchers <- bbref_tables[[1]]

pitchers <- pitchers |> slice(-31:-33)

hitters_selected <- hitters %>% select(Tm, OPS_plus = OPS+) pitchers_selected <- pitchers %>% select(Tm, ERA_plus = ERA+) data <- inner_join(hitters_selected, pitchers_selected, by = "Tm") view(data)

data1 <- data |> mutate( OPS_plus = as.numeric(OPS_plus), ERA_plus = as.numeric(ERA_plus) )

jacobakaye commented 6 months ago

Thank you. Now, I need to link mlbplotR into this which (I think) is complicated b/c it needs team abbreviations.

jacobakaye commented 6 months ago

Thank you so much! I was able to make this: ops_vs_era_plot

kndunlap commented 6 months ago

Awesome! Good job. That looks great.

On Fri, May 31, 2024 at 12:50 PM jacobakaye @.***> wrote:

Thank you so much! I was able to make this: ops_vs_era_plot.png (view on web) https://github.com/kndunlap/baseball-web-scrape/assets/126515518/09aedaae-b67d-4243-a790-2e6e4c03cae0

— Reply to this email directly, view it on GitHub https://github.com/kndunlap/baseball-web-scrape/issues/1#issuecomment-2142811960, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORVLBK5Y3KF2ALH6YBT2PLZFDA6TAVCNFSM6AAAAABD2QRC3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBSHAYTCOJWGA . You are receiving this because you commented.Message ID: @.***>

kndunlap commented 6 months ago

Let me know if I can help with anything else. I enjoy this kind of analysis but I'm not very creative with my ideas.

jacobakaye commented 6 months ago

Thank you! Things I'd like to do via web scrape or baseballR:

kndunlap commented 6 months ago

Game preview tables would be interesting and potentially not that challenging to do. What would be on this table - lineups, stats, park info, etc?

jacobakaye commented 6 months ago

Team stats and where they rank among league - maybe how their trending in the past 2 weeks? - stuff like that. Really anything cool and unique


From: kndunlap @.> Sent: Wednesday, June 5, 2024 1:39:23 PM To: kndunlap/baseball-web-scrape @.> Cc: Jacob Aaron Kaye @.>; Author @.> Subject: Re: [kndunlap/baseball-web-scrape] question (Issue #1)

Game preview tables would be interesting and potentially not that challenging to do. What would be on this table - lineups, stats, park info, etc?

— Reply to this email directly, view it on GitHubhttps://github.com/kndunlap/baseball-web-scrape/issues/1#issuecomment-2150607614, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6FHSPSWOIURN4LG7ZASR4DZF5EMXAVCNFSM6AAAAABD2QRC3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQGYYDONRRGQ. You are receiving this because you authored the thread.Message ID: @.***>

kndunlap commented 6 months ago

That seems pretty reasonable. I can start to cook up some code for that.

jacobakaye commented 6 months ago

Sounds good. I could format the table


From: kndunlap @.> Sent: Wednesday, June 5, 2024 2:57:55 PM To: kndunlap/baseball-web-scrape @.> Cc: Jacob Aaron Kaye @.>; Author @.> Subject: Re: [kndunlap/baseball-web-scrape] question (Issue #1)

That seems pretty reasonable. I can start to cook up some code for that.

— Reply to this email directly, view it on GitHubhttps://github.com/kndunlap/baseball-web-scrape/issues/1#issuecomment-2150750881, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6FHSPVZUR3B7NSKOE3X7PLZF5NTHAVCNFSM6AAAAABD2QRC3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQG42TAOBYGE. You are receiving this because you authored the thread.Message ID: @.***>

kndunlap commented 6 months ago

So if you look in the github i have a file called "tables.R" the function at the bottom allows you to plug in a stat and a dataset (in that case, pitchers or hitters) and it will rank teams by that stat.

kndunlap commented 6 months ago

im going on vacation though so I won't work on this for a few days

jacobakaye commented 6 months ago

all good. something like this would be sick --

Screenshot 2024-06-05 at 9 14 35 PM
kndunlap commented 5 months ago

So i have an early draft of some code that will scrape bbref and baseball savant stats. At the end is a function where you can pick two teams and 5 stats and it will output a table like what you have shared in that picture. It's only for hitters right now not pitchers. Let me know if you have any questions or what else I should implement.

Code is in "collab.R" in my github.

jacobakaye commented 5 months ago

I like it a lot. Maybe some more stats besides the basic ones? I know it could be difficult though.

Maybe for pitching, combine the hitting and pitching datasets together just so it's easier?

That may just be easier to make the table with gt

kndunlap commented 5 months ago

Sure. What other stats are you interested in? There’s around 65 on this table.

I could also work on pulling from fangraphs and adding to the table.

Combining the pitching and hitting datasets would a challenge. I can just keep them separate. Should be easy I just need to work on it.

On Mon, Jun 10, 2024 at 5:26 PM jacobakaye @.***> wrote:

I like it a lot. Maybe some more stats besides the basic ones? I know it could be difficult though.

Maybe for pitching, combine the hitting and pitching datasets together just so it's easier?

That may just be easier to make the table with gt

— Reply to this email directly, view it on GitHub https://github.com/kndunlap/baseball-web-scrape/issues/1#issuecomment-2159475963, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORVLBOIZXQQUK4LAHI2YMTZGYYYVAVCNFSM6AAAAABD2QRC3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZGQ3TKOJWGM . You are receiving this because you commented.Message ID: @.***>

jacobakaye commented 5 months ago

Um... not really sure. This code should help combine the datasets (or serve as an example)

team_hitting_leaders <- (fg_team_batter(qual = "y", startseason = 2024, endseason = 2024))
view(team_hitting_leaders)
#glimpse(hitting_leaders)
#head(hitting_leaders)

team_pitching_leaders <- (fg_team_pitcher(qual = "y", startseason = 2024, endseason = 2024))
view(team_pitching_leaders)

# Convert to data frames
team_hitting_leaders <- as.data.frame(team_hitting_leaders)
team_pitching_leaders <- as.data.frame(team_pitching_leaders)

# Select relevant columns
team_hitting <- team_hitting_leaders %>%
  select(team_name, wRC_plus)

team_pitching <- team_pitching_leaders %>%
  select(team_name, 'ERA-')

# Merge the dataframes on team_name
data <- merge(team_hitting, team_pitching, by = "team_name")

# View the merged dataframe
print(data)
jacobakaye commented 5 months ago

maybe like more splits? starters era? relievers era? stuff like that? what do you think?