GKs performance estimation?

Hi @uPeppe,

In the accompanying blog post you stated how the performance estimation model was good on movement players, but subpar on goalkeepers (GKs). Have you further experimented on the root causes of this phenomenon?

Hello @DiTo97,

I've not made further analysis on that, but the problem with goalkeepers is that there is only one per team (of 20), and this results to the dataset being made by 20 groups of entries, where the N entries in the group have the same "player stats" and "team stats" parts and only differ on the "opposing team stats" part.

My guess is that more attempts have to be done for selecting a better network structure, and more effort should be put in selecting less and more relevant features for goalkeepers.

It took me some time to understand everything, but after digging inside each notebook I think I have a clearer view of your work now.

IIUC, your suspicion is that overfitting is the root cause of the problem, as the number of goalkeepers is much smaller than the number of outfield players making a dataset with many rows very similar or even identical for a good portion of the features (the player and team features that you were mentioning).

This could be absolutely true, but I have a few more concerns after analysing the notebooks culminating with the fact that the GK model seems to be the one performing best of the two.

I will list them in points so that you can address each of them individually, albeit not reporting the name of the corresponding notebook as I honestly do not remember them all.

the early exploratory notebooks show that the distribution of votes and fantavotes in the GK data is much less spread out than that in the outfield data with a non-negligible concentration in $6.0 \pm 0.2$. This statement should be backed up by a thorough clustering analysis of the distribution that I don't see in those notebooks, but if this was the case, it might make the GK problem more challenging and easier at the same time. Indeed, a denser concentration in that region would lead the model to predict the sample to fall inside it with a higher probability, but it would likely struggle to precisely estimate its position inside the boundaries of the region
have you done an ablation study of all the imputation techniques that you are using, ensuring that they actually do improve the modelling of the problem? I see lots of them spread in the code, such as averaging over previous Serie A seasons, averaging over the current Serie A season or substitution of new players with affine players in the same team from previous Serie A seasons
have you done an ablation study on the weighted temporal averaging that you perform between features from the current Serie A season and past ones? The feature space is not that large so I would let the model learn that weighted temporal dependency for you aggregating past features over the same row for each player. In this way you could even add shorter rolling windows to the feature space, i.e., not only features from the previous Serie A season and the current season, but also features from the past month or week.
I think you have a temporal leakage problem, as you train the model over all the matchdays spanning a Serie A season after splitting the train and test data randomly. In this way, you might train the model on a matchday in the future to predict a matchday in the past. I would make sure that the data split respects the sequential nature of matchdays in the dataset
the $R^2$ score that you report as performance metric describes a fitter model with a score closer to $1.0$ while $0.0$ describes a model that always predicts the train split mean regardless of the input features. I don't understand why you think the GK model is performing worse than the outfield model, while its $R^2$ score is actually much higher. The eye test seems to suggest the same, as the GK predictions much more closely follow the anti-diagonal, while outfield predictions are mostly flat on the y-axis no matter how large the true votes or fantavotes are, as if they are limited always predicting some train data mean. Interestingly, the basic MLP regressor seems to have a better performance than the sinharcsinh neural network.

1) Exactly! Outfield players and goalkeepers fanta-votes have very different distributions due to the different type of bonus they have in the game. Goalkeepers votes tend to be less variable in respect to their average, so it makes sense that the correlation R2 is better than outfield. However, I still consider these predictions worse when I look at them from my Fantasy Football player point of view, especially when ranking the goalkeepers on a single matchday. In other words, looking at the predictions without knowing anything about the model, as a player I would rely on what I've seen for outfield players, but can't say the same for goalkeepers. Of course, this is not (yet) supported by data. Also, the clean sheet probability prediction was not that good.

2) Initially, I absolutely needed to take into account past seasons for players who had not played a lot of minutes in Serie A, so their stats were not reliable. Going forward with the season, for most players, who had played more than 10 games, the weight of past seasons was basically nill. The coefficients used for averaging have not be selected with any technique, but by thumb. What technique would you suggest for a better selection? An additional model for evaluating the best way to estimate the performance when a player has a low number of games in the current season would be an interesting problem on its own. Even now that the new season starts, it would be needed for players who haven't played yet in Serie A and have only available stats from other leagues.

3/4) One thing that maybe is not clear, and it is a limit of the features, is that for a given matchday the stats for players and teams are the ones averaged on the whole current season, and not the stats that take into account the games only up to that matchday. This is due to the kind of data I could scrape from FBRef. Then, there isn't really a temporal component in features. And I agree that it would be better if it was present.

5) Partially answered in point 1. Also, one thing to add is that the sinarcsinh network learns a distribution and not a regression prediction. Then, it has a different purpose than MLPRegressor, and the correct way to evaluate its performance is by its loss function negative log likelihood In the last notebook commit I skipped training, so that's why you don't see neglog performance. The calculated R2 is based on assuming to have a regression prediction equal to the mean of the probability distributions, and that was not really the task of the models. In fact, thanks to having a probability distribution as output, it is possible to evaluate the "potential" vote for a player (by thumb assumed to be mean + variance). So it makes sense to me that a neural network regressor outperformed the probability distribution one in that, but it is not able for example to distinguish that an inconsistent attacker is more likely to have an higher fanta-vote than a consistent defender.

Ps: I wrote you an email with additional questions

I will answer quoting your points:

Exactly! Outfield players and goalkeepers fanta-votes have very different distributions due to the different type of bonus they have in the game. Goalkeepers votes tend to be less variable in respect to their average, so it makes sense that the correlation R2 is better than outfield. However, I still consider these predictions worse when I look at them from my Fantasy Football player point of view, especially when ranking the goalkeepers on a single matchday. In other words, looking at the predictions without knowing anything about the model, as a player I would rely on what I've seen for outfield players, but can't say the same for goalkeepers. Of course, this is not (yet) supported by data. Also, the clean sheet probability prediction was not that good.

Then we should look for some other metric, other than $R^2$ that more closely matches your on-field perspective and feel on the performance of the two estimation models, even something as simple as mean absolute error (MAE) of the sampled votes and fantavotes w.r.t. the true votes and fantavotes. Of course we could try with different metrics depending on the degree of skewness (e.g., RMSLE).

Initially, I absolutely needed to take into account past seasons for players who had not played a lot of minutes in Serie A, so their stats were not reliable. Going forward with the season, for most players, who had played more than 10 games, the weight of past seasons was basically nill. The coefficients used for averaging have not be selected with any technique, but by thumb. What technique would you suggest for a better selection?

The rationale seems sound as the performance in the current season becomes the more informative variable as the season goes on, while at the beginning of the season we have little to no information besides past performance. This is similar to the cold start problem in recommendation systems. To better model the current form of each player you could define a decay function to model the temporal dependency and the effect that each past perfomance has on the form.

For instance, the following decay function would give max weight (null decay) to events happened in a week time, and decays everything else following a sinusoidal function:

import numpy as np
from scipy.interpolate import interp1d

T_secs_hour = 60 * 60
T_secs_day = T_secs_hour * 24
T_secs_week = T_secs_day * 7

@np.vectorize
def timestamp_decay(secs: int) -> float:
    """The timestamp decay function in [0, 1]

    It weighs two subsequent events depending on how close they have happened
    """
    if secs < 0:
        raise ValueError("The interval must be non-negative")

    if secs > T_secs_week:
        return 0.

    freq = np.pi / T_secs_week
    proj = interp1d([-1, 1], [0, 1])

    return proj(np.cos(secs * freq))

The time decay function could be a parametric function in itself (e.g., a neural network), but such a formulation is a good enough baseline.

An additional model for evaluating the best way to estimate the performance when a player has a low number of games in the current season would be an interesting problem on its own. Even now that the new season starts, it would be needed for players who haven't played yet in Serie A and have only available stats from other leagues.

I agree, and there would be many different ways to approach it starting from getting play data from different competitions if none in Serie A was present for the current season.

Personally, I would use the fantaindex estimated by the fantaGOAT service in absence of data, as it would avoid you navigating through thousands of features to model the cold start problem as they have already done.

3/4) One thing that maybe is not clear, and it is a limit of the features, is that for a given matchday the stats for players and teams are the ones averaged on the whole current season, and not the stats that take into account the games only up to that matchday. This is due to the kind of data I could scrape from FBRef. Then, there isn't really a temporal component in features. And I agree that it would be better if it was present.

I am not familiar with the FBRef service, but is it a matter of how they present the data or of how your scraping code is constructed? Anyhow I know a few other sources that would give you the data at this granularity, which could greatly favour such a time series problem.

uPeppe / fantabeto

GKs performance estimation? #1