Open jacobakaye opened 9 months ago
Jacob,
What exactly are you trying to download? If you provide more detail on what you're looking for, I can try and help you out.
Some of this code is unfinished as well.
The R files to web scrape. And get the data that you have in the csv's
For the CSVs you can click on the file and then download them. For the R file you can either download the files or copy paste into RStudio.
My bad... I wasn't clear. From the R files, how can I get the data from the CSVs without actually downloading the CSVs
Oh sorry. I don't have a way to do that yet If you want to use them you'll have to download them directly.
Got it. Once the season starts, how often will those be updated?
I don't know if I will update it. This was more of a one-off project to model some stats based on the 2023 season. Is there something specific you'd like to see? I'm open to more ideas to practice my R skills.
I guess a way to show something regarding the 2024 season -- can be updated throughout the season.
One thing I guess I could do is change the url to link to live stats, but savant won't have posted those pages for the 2024 season yet. Once we get closer to the season I might work on that if I have time. Are you trying to do anything with the upcoming 2024 data?
Yes, I'm trying to make a table/viz that can be replicated throughout the season. The dream is to show a bunch of team ranks and how they've changed throughout the past week.
I see. That might not be too difficult once I find out where good team stats are stored. For that would you want team stats or player stats and then make team stats out of the player stats?
Fangraphs is good https://billpetti.github.io/baseballr/reference/fangraphs.html
I would want team stats
That fangraphs package looks good. It seems better than the code I have.
Yes - its great. How did you get the OPS+ xlsx document?
For some reason that document isn't working for me, downloading it just an empty excel file. Do you see anything within the document?
Nope. Empty too.
Hi - is there a way to scrape OPS+ and ERA+ from baseball reference?
You can use the baseballR package. Are you looking for a specific team or the whole league?
Specific team. I wasn't able to find the plus stats in the package. Maybe I'm missing something?
Try this code:
library(tidyverse) library(rvest) bbref_url <- "https://www.baseball-reference.com/teams/DET/2024.shtml" html_bbref <- read_html(bbref_url) bbref_tables <- html_bbref |> html_table() hitters <- bbref_tables[[8]] pitchers <- bbref_tables[[9]]
I'm a tigers fan so I used tigers as an example but you can change that link. Let me know if it works.
Yes, it worked. Thank you. How do I get it on a team level?
Do you mean specific teams? A couple examples of changing the link, which lets you go by year too.
PIT: "https://www.baseball-reference.com/teams/PIT/2024.shtml" NYM: "https://www.baseball-reference.com/teams/NYM/2021.shtml"
BTW can you please try out my app. You pick a team and it graphs places based on their slugging vs expected slugging. https://kndunlap.shinyapps.io/testapp/
Just want to be able to create graph with ERA+ vs OPS+
I'm not sure how to get every team at once. You could import all 30 teams with different variable names and then use rbind or cbind to merge the dataframes together. Then you'd have 1 big dataset of all the players.
Gotcha. Wouldn't combining these work?
OPS+ (https://www.baseball-reference.com/leagues/majors/2024-standard-batting.shtml) ERA+ (https://www.baseball-reference.com/leagues/majors/2024-standard-pitching.shtml)
I tried that but for some reason the web scraping package doesn't pull out the table of individual batters, only the first one on the page of teams.
But if I just want teams on both pages then that should work right?
I think so - sub in the link and it probably will work great. Let me know how it works for you. Code below
library(tidyverse) library(rvest)
bbref_url <- "https://www.baseball-reference.com/leagues/majors/2024-standard-batting.shtml" html_bbref <- read_html(bbref_url) bbref_tables <- html_bbref |> html_table() hitters <- bbref_tables[[1]]
bbref_url <- "https://www.baseball-reference.com/leagues/majors/2024-standard-pitching.shtml" html_bbref <- read_html(bbref_url) bbref_tables <- html_bbref |> html_table() pitchers <- bbref_tables[[1]]
Yes it worked. I noticed this- not sure how to fix and I'm sure its very simple to do so. The header/first row repeats twice. How can I remove the second one?
Good question. You can do this, assuming your table is named pitchers, and the row you want to get rid of is the first one.
pitchers |> slice(-1)
So this code completely works, but for whatever reason in the 'view' page, it doesn't sort properly....
library(tidyverse)
library(rvest)
library(mlbplotR)
library(dplyr)
library(ggplot2)
bbref_url <- "https://www.baseball-reference.com/leagues/majors/2024-standard-batting.shtml"
html_bbref <- read_html(bbref_url)
bbref_tables <- html_bbref |>
html_table()
hitters <- bbref_tables[[1]]
hitters <- hitters |>
slice(-31:-33)
bbref_url <- "https://www.baseball-reference.com/leagues/majors/2024-standard-pitching.shtml"
html_bbref <- read_html(bbref_url)
bbref_tables <- html_bbref |>
html_table()
pitchers <- bbref_tables[[1]]
pitchers <- pitchers |>
slice(-31:-33)
hitters_selected <- hitters %>% select(Tm, OPS_plus = `OPS+`)
pitchers_selected <- pitchers %>% select(Tm, ERA_plus = `ERA+`)
data <- inner_join(hitters_selected, pitchers_selected, by = "Tm")
view(data)
In the image, OPS+ is sorted descending
R was treating your number columns as characters and sorting them "alphabetically". I added to the end of your code to change those columns to numerics.
library(tidyverse) library(rvest) library(mlbplotR) library(dplyr) library(ggplot2)
bbref_url <- "https://www.baseball-reference.com/leagues/majors/2024-standard-batting.shtml" html_bbref <- read_html(bbref_url) bbref_tables <- html_bbref |> html_table() hitters <- bbref_tables[[1]]
hitters <- hitters |> slice(-31:-33)
bbref_url <- "https://www.baseball-reference.com/leagues/majors/2024-standard-pitching.shtml" html_bbref <- read_html(bbref_url) bbref_tables <- html_bbref |> html_table() pitchers <- bbref_tables[[1]]
pitchers <- pitchers |> slice(-31:-33)
hitters_selected <- hitters %>% select(Tm, OPS_plus = OPS+
)
pitchers_selected <- pitchers %>% select(Tm, ERA_plus = ERA+
)
data <- inner_join(hitters_selected, pitchers_selected, by = "Tm")
view(data)
data1 <- data |> mutate( OPS_plus = as.numeric(OPS_plus), ERA_plus = as.numeric(ERA_plus) )
Thank you. Now, I need to link mlbplotR into this which (I think) is complicated b/c it needs team abbreviations.
Thank you so much! I was able to make this:
Awesome! Good job. That looks great.
On Fri, May 31, 2024 at 12:50 PM jacobakaye @.***> wrote:
Thank you so much! I was able to make this: ops_vs_era_plot.png (view on web) https://github.com/kndunlap/baseball-web-scrape/assets/126515518/09aedaae-b67d-4243-a790-2e6e4c03cae0
— Reply to this email directly, view it on GitHub https://github.com/kndunlap/baseball-web-scrape/issues/1#issuecomment-2142811960, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORVLBK5Y3KF2ALH6YBT2PLZFDA6TAVCNFSM6AAAAABD2QRC3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBSHAYTCOJWGA . You are receiving this because you commented.Message ID: @.***>
Let me know if I can help with anything else. I enjoy this kind of analysis but I'm not very creative with my ideas.
Thank you! Things I'd like to do via web scrape or baseballR:
Game preview tables would be interesting and potentially not that challenging to do. What would be on this table - lineups, stats, park info, etc?
Team stats and where they rank among league - maybe how their trending in the past 2 weeks? - stuff like that. Really anything cool and unique
From: kndunlap @.> Sent: Wednesday, June 5, 2024 1:39:23 PM To: kndunlap/baseball-web-scrape @.> Cc: Jacob Aaron Kaye @.>; Author @.> Subject: Re: [kndunlap/baseball-web-scrape] question (Issue #1)
Game preview tables would be interesting and potentially not that challenging to do. What would be on this table - lineups, stats, park info, etc?
— Reply to this email directly, view it on GitHubhttps://github.com/kndunlap/baseball-web-scrape/issues/1#issuecomment-2150607614, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6FHSPSWOIURN4LG7ZASR4DZF5EMXAVCNFSM6AAAAABD2QRC3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQGYYDONRRGQ. You are receiving this because you authored the thread.Message ID: @.***>
That seems pretty reasonable. I can start to cook up some code for that.
Sounds good. I could format the table
From: kndunlap @.> Sent: Wednesday, June 5, 2024 2:57:55 PM To: kndunlap/baseball-web-scrape @.> Cc: Jacob Aaron Kaye @.>; Author @.> Subject: Re: [kndunlap/baseball-web-scrape] question (Issue #1)
That seems pretty reasonable. I can start to cook up some code for that.
— Reply to this email directly, view it on GitHubhttps://github.com/kndunlap/baseball-web-scrape/issues/1#issuecomment-2150750881, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6FHSPVZUR3B7NSKOE3X7PLZF5NTHAVCNFSM6AAAAABD2QRC3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQG42TAOBYGE. You are receiving this because you authored the thread.Message ID: @.***>
So if you look in the github i have a file called "tables.R" the function at the bottom allows you to plug in a stat and a dataset (in that case, pitchers or hitters) and it will rank teams by that stat.
im going on vacation though so I won't work on this for a few days
all good. something like this would be sick --
So i have an early draft of some code that will scrape bbref and baseball savant stats. At the end is a function where you can pick two teams and 5 stats and it will output a table like what you have shared in that picture. It's only for hitters right now not pitchers. Let me know if you have any questions or what else I should implement.
Code is in "collab.R" in my github.
I like it a lot. Maybe some more stats besides the basic ones? I know it could be difficult though.
Maybe for pitching, combine the hitting and pitching datasets together just so it's easier?
That may just be easier to make the table with gt
Sure. What other stats are you interested in? There’s around 65 on this table.
I could also work on pulling from fangraphs and adding to the table.
Combining the pitching and hitting datasets would a challenge. I can just keep them separate. Should be easy I just need to work on it.
On Mon, Jun 10, 2024 at 5:26 PM jacobakaye @.***> wrote:
I like it a lot. Maybe some more stats besides the basic ones? I know it could be difficult though.
Maybe for pitching, combine the hitting and pitching datasets together just so it's easier?
That may just be easier to make the table with gt
— Reply to this email directly, view it on GitHub https://github.com/kndunlap/baseball-web-scrape/issues/1#issuecomment-2159475963, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORVLBOIZXQQUK4LAHI2YMTZGYYYVAVCNFSM6AAAAABD2QRC3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZGQ3TKOJWGM . You are receiving this because you commented.Message ID: @.***>
Um... not really sure. This code should help combine the datasets (or serve as an example)
team_hitting_leaders <- (fg_team_batter(qual = "y", startseason = 2024, endseason = 2024))
view(team_hitting_leaders)
#glimpse(hitting_leaders)
#head(hitting_leaders)
team_pitching_leaders <- (fg_team_pitcher(qual = "y", startseason = 2024, endseason = 2024))
view(team_pitching_leaders)
# Convert to data frames
team_hitting_leaders <- as.data.frame(team_hitting_leaders)
team_pitching_leaders <- as.data.frame(team_pitching_leaders)
# Select relevant columns
team_hitting <- team_hitting_leaders %>%
select(team_name, wRC_plus)
team_pitching <- team_pitching_leaders %>%
select(team_name, 'ERA-')
# Merge the dataframes on team_name
data <- merge(team_hitting, team_pitching, by = "team_name")
# View the merged dataframe
print(data)
maybe like more splits? starters era? relievers era? stuff like that? what do you think?
what do I need to download to do this. i tried and it didn't work