j-andrews7 / kenpompy

A simple yet comprehensive web scraper for kenpom.com.
https://kenpompy.readthedocs.io/en/latest/?badge=latest
GNU General Public License v3.0
70 stars 21 forks source link

Enhancement: Team Stats table #40

Closed nickostendorf closed 10 months ago

nickostendorf commented 1 year ago

The current team module only pulls valid teams and team schedules. This PR allows retrieving of a team's stats similar to pulling their schedule.

Kenpom's team pages uses javascript to fill the majority of the 'Scouting Report' table. Therefore, these values are not available to MechanicalSoup since MechanicalSoup is only an HTML parser.

I used two packages to help with the extraction:

Walkthrough

Open page as normal

browser.open(url)
team_page = browser.get_current_page()

Save page temporarily so that we can make a request to it using requests_html

# Save page to temporarily
with tempfile.NamedTemporaryFile(delete=False, suffix='.html') as file:
    file.write(team_page.encode())

Open saved paged with requests_html session We use requests_file's FileAdapter to allow requests_html to open local files. The render() function opens the session in a headless chromium browser. Then it returns the html with javascript executed.

sess = HTMLSession()
sess.mount('file://', FileAdapter())
r = sess.get(f'file:///{file.name}')
r.html.render()

Create a new BeautifulSoup object from rendered html and find stats table

# Create soup object from rendered page
soup = BeautifulSoup(r.html.html, 'html.parser')

# Retrieve stats table and parse into DataFrame
statsTable = soup.findAll('table')[0]
stats_df = pd.read_html(str(statsTable))[0]

The first time render() is called on a system, it will download a chromium binary (~85mb) I didn't like the idea of making a user download a whole browser binary just for a table. However, this route is still quicker and lighter than going with a selenium approach. I also believe this requests_html approach can be used to retrieve more missing data around Kenpom.

I have added a couple tests in test_team.py. Please let me know if anyone has any questions.

Thanks!

j-andrews7 commented 1 year ago

This was rather where I left off with the team/player pages. I am generally fine with the approach, even with the download presuming we document it and make it explicitly clear.

Give me a shout when you feel it's in a reviewable state and I (and @esqew if he's available) will take a look.

nickostendorf commented 1 year ago

@j-andrews7 This is ready for review.

There are a few caveats to be mindful of:

esqew commented 1 year ago

Full disclosure here - I've only perused the code at a surface level thus far. Just one initial question about this point:

  • This cannot run within a notebook due do an eventloop error. We would need to use AsyncHTMLSession() instead of HTMLSession() from requests_html.

What, in your view, would be the lift to use AsyncHTMLSession() from what you've got now? I have to imagine we have quite a few Jupyter users (including myself). If it's not worth it right now, we could also shoot to upgrade in a future release.

nickostendorf commented 1 year ago

@esqew I've made some changes that enable use inside of Jupyter Notebooks.

To use get_stats() within a Jupyter Notebook, the user has to await the get_stats() function.

ex: await team.get_stats(browser, 'Purdue', _async=True)

AsyncHTMLSession creates coroutines when get() and arender() are called. This was a problem before because Juypter itself runs a constant event loop. I didn't have any prior experience with asyncio within a notebook so I didn't understand why the coroutines were failing.

Anyway, I have made some new commits that enable get_stats() to work within a notebook. Whenever you or @j-andrews7 have time please give it a test and relay some feedback,

esqew commented 1 year ago

Sorry for the radio silence! I mentioned in the other thread that work + moving has been kicking my ass, so I haven't had time to go through your latest commit.

If Jared has time to review he'll probably be able to get to it before me, but I should have time to review + approve a merge early in the week of 22-Jan.

In any event, I really appreciate you taking the time to architect this. This will be a great feature to have.

nickostendorf commented 1 year ago

@j-andrews7

One, it's probably worth retaining the rankings for each metric rather than stripping them.

I will strip these into a new column next to the respective stat.


More importantly, I can't get any results returned from this.

from kenpompy.utils import login
df = kp.get_stats(browser, "Louisville", season = "2013")
df

yields a <coroutine object get_stats at 0x7f7d5cb04ba0>. Use of _async = True yields the same result. I imagine this is what you referenced with unresolved eventloops. It could be a filepath issue, as I'm using Ubuntu 20.04 via WSL.

Make sure to call await before the function call with _async = True. I should have addressed this in the docstring.. will be added.

Also, I didn't like this part of the implementation because it seems a little messy. There may be a way to wrap this call in a function if _async == True. I will look into this.


Regardless, there should probably be a few changes. Use OS-agnostic paths if possible. And check that the final df is not a coroutine object before it's returned and throw an error if it is.

For OS-agnostic paths, are you referencing this?

if _async:
  sess = AsyncHTMLSession()
  sess.mount('file://', FileAdapter())
  r = await sess.get(f'file:///{filename}')
  await r.html.arender()
else:
  sess = HTMLSession()
  sess.mount('file://', FileAdapter())
  r = sess.get(f'file:///{filename}')
  r.html.render()

I may have a few other comments on output once I can get results. It'd be really nice to able to check the "Conference only" box to yield those stats, but that can also be added later. I can't remember if I ran into issues with that or not. At minimum, it should be mentioned in the docstring that this only returns the full season results at the moment.

I agree with Conference only stats. I missed that since I haven't been using that box before these last couple weeks. I will look into adding this functionality.

I will update the docstring according to everything that has been mentioned here.

Thanks for the feedback!

j-andrews7 commented 1 year ago

For OS-agnostic paths, are you referencing this?

Well, more this:

if sys.platform == 'win32':
        filename = file.name.replace('\\', '//')
    else:
        filename = file.name

But on closer look, I haven't used asyncio at all, so I don't know what nonsense is expected for the session mounting and all.

Make sure to call await before the function call with _async = True. I should have addressed this in the docstring.. will be added.

So do I have to wrap it all in an async function to get the info out given that await is only usable in an async environment? I thought asyncio.run(kp.get_stats(browser, "Louisville", season = "2013", _async = True)) would work, but it also returns an error about the browser closing unexpectedly. May be good to just add an example of a proper call that works for you to the docstring as well.

nickostendorf commented 1 year ago

Hey @j-andrews7, sorry for the silence. I have been really busy with work this past month. I am getting back into the revisions you've requested and will have them updated in the PR this weekend.

I will update the documentation to explain how to use the async function. I believe I have found an easier way to call the method than how I wrote it before. I will send you a note when everything is ready for review!