nflverse / nflverse-rosters

builds roster data for nflverse/nflverse-data
Other
20 stars 4 forks source link

[FEATURE REQ] Coach Headshots #82

Open alecglen opened 3 months ago

alecglen commented 3 months ago

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

No response

Describe the solution you'd like

Make coach headshots available e.g. for nflplotR.

Here are the relevant links as of 2024-08-06: nfl_coaches.csv

Here's the script to pull them, which could be updated to also grab coordinators, etc.

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

coaches = []

all_teams_page = requests.get("https://www.nfl.com/teams/")
all_teams_page.raise_for_status()

all_teams_soup = BeautifulSoup(all_teams_page.text, "lxml")
main_content = all_teams_soup.find("main", {"id": "main-content"})

for linkbutton in main_content.find_all("a", string=re.compile(r"View Full Site")):
    team: str = linkbutton.find_previous("p").text.strip()
    site: str = linkbutton["href"]
    print(f"{team}: {site}")

    try:
        team_coaches_url = site.strip("/") + "/team/coaches-roster/"
        team_coaches_page = requests.get(team_coaches_url)
        team_coaches_page.raise_for_status()
    except requests.HTTPError:
        team_coaches_url = site.strip("/") + "/team/coaches/"
        team_coaches_page = requests.get(team_coaches_url)
        team_coaches_page.raise_for_status()

    team_coaches_soup = BeautifulSoup(team_coaches_page.text, "lxml")
    coaches_main = team_coaches_soup.find("main", {"id": "main-content"})

    hc_text = coaches_main.find("h5", string=re.compile(r"Head Coach"))

    try:
        hc_name = hc_text.find_previous("h3").text.strip()
        assert len(hc_name.split()) == 2
    except (AttributeError, AssertionError):
         hc_name = hc_text.find_next("h3").text.strip()
         assert len(hc_name.split()) == 2

    hc_headshot = hc_text.find_previous("img")
    headshot_url = hc_headshot.get("data-src") or hc_headshot["src"]
    assert headshot_url.startswith("https://static.clubs.nfl.com/image/")

    headshot_url = re.sub("t_[a-z_]*/", "", headshot_url)

    coaches.append({
        "team": team,
        "team_site_source": team_coaches_url,
        "heach_coach": hc_name,
        "headshot_url": headshot_url
    })

    print(f"{hc_name} {headshot_url}")
    print()

pd.DataFrame(coaches).to_csv("nfl_coaches.csv", index=False)

Describe alternatives you've considered

No response

Additional context

Per Discord discussion https://discord.com/channels/789805604076126219/924673653961003098/1270566142586523649

john-b-edwards commented 1 month ago

I suspect there's a /coaches/ endpoint hanging out somewhere aorund here that we might be able to hit more cleanly, but this is a great start. Will see about trying to find said endpoint or incorporating this info otherwise when I have an opportunity.