j-andrews7 / kenpompy

A simple yet comprehensive web scraper for kenpom.com.
https://kenpompy.readthedocs.io/en/latest/?badge=latest
GNU General Public License v3.0
70 stars 21 forks source link

FanMatch Conference Tourneys impacting team name parsing #47

Closed nickostendorf closed 1 year ago

nickostendorf commented 1 year ago

@j-andrews7 @esqew

With the conference tournaments going on this week, I noticed some of the FanMatch output showing the conference tournament name in them.

FanMatch errors

I ran the FanMatch module and stepped through the logic for 3/3. I noticed that if the home team is the PredictedLoser and it is a conference tourney game, then the result of the team name will include the tournament abbreviation. However, if the above scenario is not true, then the conference tourney is parsed to the possessions column.

Here is how the PredictedLoser is getting parsed for the above scenario. It is retrieving everything after the team's ranking, therefore adding tourney abbreviation.

x[0] = " ".join(x[0].split()[1:])
x[1] = " ".join(x[1].split()[1:])

and here is the logic for how possessions are getting parsed:

pos = fm_df.Game.str.split(r" \[").str[1]
fm_df["Game"], fm_df["Possessions"] = fm_df.Game.str.split(r" \[").str[0], pos.astype("str")
fm_df.Possessions = fm_df.Possessions.str.strip(r"\]")

This is just a string splitting issue. I was wondering if either of you guys had any ideas on how to go about this. We could create a list of conference tourney's and remove them from the Game column as soon as we retrieve the FanMatch table. Obviously, we would need to collect all of KenPom's tourney abbreviations for this to work without error.

Thanks, let me know what you think.

esqew commented 1 year ago

Good call out. The ideal solution is to just tighten up on the pattern matching to account for the conference tournament label potentially being present around this time of year. It would be significantly preferable over having to hard-code any conference labeling data so as to avoid the resulting maintenance overhead altogether.

This should be a somewhat easy fix for someone with the right level of RegExp knowledge. I'll have time to take a peek at this on Monday at the latest.

Without looking at them directly, this also smells like a test that should be edited to raise this type of issue in the future and I would strongly encourage any PR for this to include changes to the existing test suite which would help to highlight this moving forward.

nickostendorf commented 1 year ago

Yeah I think Regex is probably the best way to alleviate errors in the long run.

I did just work up a quick fix for the time being if anyone needs a solution:

And added this before any other parsing begins:

p = re.compile('|'.join(map(re.escape, self.conf_tourneys)))
fm_df['Game'] = [p.sub('', tourney) for tourney in fm_df['Game']]

This is not a concrete solution as labels can change over time like you said. I'll look into some Regex and poke at it some more.

Let me know if you find a better solution!

j-andrews7 commented 1 year ago

Seems a pretty easy grep/regex with "-T". I won't have time to mess with this for at least a few weeks, but if somebody makes a PR and includes a test for stripping or including this info, I'd welcome it.

On Fri, Mar 3, 2023, 10:19 PM Sean Quinn @.***> wrote:

Good call out. The ideal solution is to just tighten up on the pattern matching to account for the conference tournament level potentially being present along this time of year. It would be significantly preferable to avoid having to hard-code any conference labeling data so as to eliminate the resulting maintenance overhead.

This should be a somewhat easy fix for someone with the right level of RegExp knowledge. I'll have time to take a peek at this on Monday at the latest.

— Reply to this email directly, view it on GitHub https://github.com/j-andrews7/kenpompy/issues/47#issuecomment-1454384869, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOAQNF3I7OBUHQTVBWHJHLW2K7E3ANCNFSM6AAAAAAVPJ4MCA . You are receiving this because you were mentioned.Message ID: @.***>