uPeppe / fantabeto

Machine learning model for predicting Serie A players performance in a match, in terms of Fantacalcio (italian fantasy football) scores.
MIT License
40 stars 8 forks source link

GKs performance estimation? #1

Open DiTo97 opened 1 year ago

DiTo97 commented 1 year ago

Hi @uPeppe,

In the accompanying blog post you stated how the performance estimation model was good on movement players, but subpar on goalkeepers (GKs). Have you further experimented on the root causes of this phenomenon?

uPeppe commented 1 year ago

Hello @DiTo97,

I've not made further analysis on that, but the problem with goalkeepers is that there is only one per team (of 20), and this results to the dataset being made by 20 groups of entries, where the N entries in the group have the same "player stats" and "team stats" parts and only differ on the "opposing team stats" part.

My guess is that more attempts have to be done for selecting a better network structure, and more effort should be put in selecting less and more relevant features for goalkeepers.

DiTo97 commented 1 year ago

It took me some time to understand everything, but after digging inside each notebook I think I have a clearer view of your work now.

IIUC, your suspicion is that overfitting is the root cause of the problem, as the number of goalkeepers is much smaller than the number of outfield players making a dataset with many rows very similar or even identical for a good portion of the features (the player and team features that you were mentioning).

This could be absolutely true, but I have a few more concerns after analysing the notebooks culminating with the fact that the GK model seems to be the one performing best of the two.

I will list them in points so that you can address each of them individually, albeit not reporting the name of the corresponding notebook as I honestly do not remember them all.

uPeppe commented 1 year ago

1) Exactly! Outfield players and goalkeepers fanta-votes have very different distributions due to the different type of bonus they have in the game. Goalkeepers votes tend to be less variable in respect to their average, so it makes sense that the correlation R2 is better than outfield. However, I still consider these predictions worse when I look at them from my Fantasy Football player point of view, especially when ranking the goalkeepers on a single matchday. In other words, looking at the predictions without knowing anything about the model, as a player I would rely on what I've seen for outfield players, but can't say the same for goalkeepers. Of course, this is not (yet) supported by data. Also, the clean sheet probability prediction was not that good.

2) Initially, I absolutely needed to take into account past seasons for players who had not played a lot of minutes in Serie A, so their stats were not reliable. Going forward with the season, for most players, who had played more than 10 games, the weight of past seasons was basically nill. The coefficients used for averaging have not be selected with any technique, but by thumb. What technique would you suggest for a better selection? An additional model for evaluating the best way to estimate the performance when a player has a low number of games in the current season would be an interesting problem on its own. Even now that the new season starts, it would be needed for players who haven't played yet in Serie A and have only available stats from other leagues.

3/4) One thing that maybe is not clear, and it is a limit of the features, is that for a given matchday the stats for players and teams are the ones averaged on the whole current season, and not the stats that take into account the games only up to that matchday. This is due to the kind of data I could scrape from FBRef. Then, there isn't really a temporal component in features. And I agree that it would be better if it was present.

5) Partially answered in point 1. Also, one thing to add is that the sinarcsinh network learns a distribution and not a regression prediction. Then, it has a different purpose than MLPRegressor, and the correct way to evaluate its performance is by its loss function negative log likelihood In the last notebook commit I skipped training, so that's why you don't see neglog performance. The calculated R2 is based on assuming to have a regression prediction equal to the mean of the probability distributions, and that was not really the task of the models. In fact, thanks to having a probability distribution as output, it is possible to evaluate the "potential" vote for a player (by thumb assumed to be mean + variance). So it makes sense to me that a neural network regressor outperformed the probability distribution one in that, but it is not able for example to distinguish that an inconsistent attacker is more likely to have an higher fanta-vote than a consistent defender.

Ps: I wrote you an email with additional questions

DiTo97 commented 1 year ago

I will answer quoting your points:

  1. Exactly! Outfield players and goalkeepers fanta-votes have very different distributions due to the different type of bonus they have in the game. Goalkeepers votes tend to be less variable in respect to their average, so it makes sense that the correlation R2 is better than outfield. However, I still consider these predictions worse when I look at them from my Fantasy Football player point of view, especially when ranking the goalkeepers on a single matchday. In other words, looking at the predictions without knowing anything about the model, as a player I would rely on what I've seen for outfield players, but can't say the same for goalkeepers. Of course, this is not (yet) supported by data. Also, the clean sheet probability prediction was not that good.

Then we should look for some other metric, other than $R^2$ that more closely matches your on-field perspective and feel on the performance of the two estimation models, even something as simple as mean absolute error (MAE) of the sampled votes and fantavotes w.r.t. the true votes and fantavotes. Of course we could try with different metrics depending on the degree of skewness (e.g., RMSLE).

  1. Initially, I absolutely needed to take into account past seasons for players who had not played a lot of minutes in Serie A, so their stats were not reliable. Going forward with the season, for most players, who had played more than 10 games, the weight of past seasons was basically nill. The coefficients used for averaging have not be selected with any technique, but by thumb. What technique would you suggest for a better selection?

The rationale seems sound as the performance in the current season becomes the more informative variable as the season goes on, while at the beginning of the season we have little to no information besides past performance. This is similar to the cold start problem in recommendation systems. To better model the current form of each player you could define a decay function to model the temporal dependency and the effect that each past perfomance has on the form.

For instance, the following decay function would give max weight (null decay) to events happened in a week time, and decays everything else following a sinusoidal function:

import numpy as np
from scipy.interpolate import interp1d

T_secs_hour = 60 * 60
T_secs_day = T_secs_hour * 24
T_secs_week = T_secs_day * 7

@np.vectorize
def timestamp_decay(secs: int) -> float:
    """The timestamp decay function in [0, 1]

    It weighs two subsequent events depending on how close they have happened
    """
    if secs < 0:
        raise ValueError("The interval must be non-negative")

    if secs > T_secs_week:
        return 0.

    freq = np.pi / T_secs_week
    proj = interp1d([-1, 1], [0, 1])

    return proj(np.cos(secs * freq))

The time decay function could be a parametric function in itself (e.g., a neural network), but such a formulation is a good enough baseline.

An additional model for evaluating the best way to estimate the performance when a player has a low number of games in the current season would be an interesting problem on its own. Even now that the new season starts, it would be needed for players who haven't played yet in Serie A and have only available stats from other leagues.

I agree, and there would be many different ways to approach it starting from getting play data from different competitions if none in Serie A was present for the current season.

Personally, I would use the fantaindex estimated by the fantaGOAT service in absence of data, as it would avoid you navigating through thousands of features to model the cold start problem as they have already done.

3/4) One thing that maybe is not clear, and it is a limit of the features, is that for a given matchday the stats for players and teams are the ones averaged on the whole current season, and not the stats that take into account the games only up to that matchday. This is due to the kind of data I could scrape from FBRef. Then, there isn't really a temporal component in features. And I agree that it would be better if it was present.

I am not familiar with the FBRef service, but is it a matter of how they present the data or of how your scraping code is constructed? Anyhow I know a few other sources that would give you the data at this granularity, which could greatly favour such a time series problem.