j-andrews7 / kenpompy

A simple yet comprehensive web scraper for kenpom.com.
https://kenpompy.readthedocs.io/en/latest/?badge=latest
GNU General Public License v3.0
70 stars 21 forks source link

Properly handle team-specific data availability for certain years #68

Closed esqew closed 9 months ago

esqew commented 10 months ago

Related to #64 - some teams have data missing for years that other teams do not. For example, most/all top-tier D1 programs have been in D1 for quite some time and thus have the complete dataset from 1999 all the way through present day.

However, more recent D-1 admits will have limited historical availability. See Merrimack, who has data only beginning in 2020.

This also means that any team between 1999 and present who (a) was in D-1, (b) did not participate in D-1 for one or more years, then (c) returned to D-1 may also have some years missing in between.

A simple solution for this is a check to make sure the navigation to a particular team's season page in team.py is not redirected back to the main page (which seems to be how KenPom is designed to behave when requesting data that doesn't exist; see how https://kenpom.com/team.php?team=Merrimack&y=2019 behaves).

esqew commented 10 months ago

I drew this up quickly in my JavaScript console when I had a second today:

[...document.querySelector('#years-wrapper #years-container').innerText.matchAll(/(?<=\s)\d{2}(?=\s)/g)].map(_ => parseInt(_[0]) > parseInt(new Date().getYear().toString().substring(1)) + 1 ? parseInt('19' + _[0]) : parseInt('20' + _[0]))

When run against Villanova's team page returns:

[1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]

This type of logic might be another way to determine whether a season parameter value is "valid". I'm going to do more digging to see what the best way might be - after giving it a bit more thought I'm of the belief that simply launching the request and checking if there was a redirect is ultimately more efficient.

WakeUpWaffles commented 10 months ago

giving it a bit more thought I'm of the belief that simply launching the request and checking if there was a redirect is ultimately more efficient.

I would have to agree with this statement. I think preventing a request if the page doesn't exist would be best, but I don't think that is worth it. An incorrect year request is only one page request, it is not like there are multiple requests needed.

However, if we wanted a check before requests we could have a dict (or another structure) that holds all the teams and the years associated with that team. I would still think a redirect check would be simpler and less error prone as the dict would need updating at the bare minimum every year. Probably not worth the overhead either.

Also, I can pick this issue up. I am palnning to work on the redirect unless there is opposition.

j-andrews7 commented 10 months ago

Go for it, PRs are welcome. We'll probably push a release here relatively soon.

On Thu, Nov 2, 2023, 7:38 PM Logan Haiflich @.***> wrote:

giving it a bit more thought I'm of the belief that simply launching the request and checking if there was a redirect is ultimately more efficient.

I would have to agree with this statement. I think preventing a request if the page doesn't exist would be best, but I don't think that is worth it. An incorrect year request is only one page request, it is not like there are multiple requests needed.

However, if we wanted a check before requests we could have a dict (or another structure) that holds all the teams and the years associated with that team. I would still think a redirect check would be simpler and less error prone as the dict would need updating at the bare minimum every year. Probably not worth the overhead either.

Also, I can pick this issue up. I am palnning to work on the redirect unless there is opposition.

— Reply to this email directly, view it on GitHub https://github.com/j-andrews7/kenpompy/issues/68#issuecomment-1791746343, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOAQNCEENIF34XSPHPDVH3YCQ4HHAVCNFSM6AAAAAA6TEUET2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJRG42DMMZUGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

esqew commented 9 months ago

As @WakeUpWaffles mentioned in the thread for #75, this is actually already handled with appropriate calls to get_valid_teams()