martineastwood / penaltyblog

Library from http://pena.lt/y/blog for modelling and working with football (soccer) data
http://pena.lt/y/blog
MIT License
56 stars 10 forks source link

Weird results when training Dixon model #7

Open monokizsolt opened 1 year ago

monokizsolt commented 1 year ago

Hi,

I have noticed that there is a dramatic difference in prediction results when training the dixon model with almost the same amount of data. Traning with the first 99 rows outputs this: Home Win: 0.4901944888036056 Draw: 0.4236429709276788 Away Win: 0.08616254025982717

But training with the first 100 (it even has a negative probability): Home Win: 0.37407906289002624 Draw: 0.6979058936975158 Away Win: -0.07198495669064632

I have prepared a small script to demonstrate this: ` import penaltyblog as pb

fb = pb.scrapers.FootballData("GRC Super League", "2022-2023")

Train with 99

df = fb.get_fixtures().iloc[:99] print(df) weight = pb.models.dixon_coles_weights(df["date"], 0.001) clf = pb.models.DixonColesGoalModel( df["goals_home"], df["goals_away"], df["team_home"], df["team_away"], weight ) clf.fit()

print(clf) print(clf.predict("Olympiakos", "Asteras Tripolis"))

Train with 100

df = fb.get_fixtures().iloc[:100] print(df) weight = pb.models.dixon_coles_weights(df["date"], 0.001) clf = pb.models.DixonColesGoalModel( df["goals_home"], df["goals_away"], df["team_home"], df["team_away"], weight ) clf.fit()

print(clf) print(clf.predict("Olympiakos", "Asteras Tripolis"))

`

I could not find why this happens, could you maybe take a look? Thanks, Zsolt

martineastwood commented 1 year ago

Thanks Zsolt - it looks like the optimiser is coming up with a value for rho that is breaking Dixon and Cole's adjustment factor. I suspect it's because you're using quite a small amount of data so the model is not converging well and so the optimiser's output is quite volatile.

Adding in the previous season's data as well helps the model converge better.

df = pd.concat(
    [
        pb.scrapers.FootballData("GRC Super League", "2021-2022").get_fixtures(),
        pb.scrapers.FootballData("GRC Super League", "2022-2023").get_fixtures(),
    ]
)[:-2]

weight = pb.models.dixon_coles_weights(df["date"], 0.001)
clf = pb.models.DixonColesGoalModel(
df["goals_home"], df["goals_away"], df["team_home"], df["team_away"], weight
)
clf.fit()

print(clf)
print(clf.predict("Olympiakos", "Asteras Tripolis"))
df = pd.concat(
    [
        pb.scrapers.FootballData("GRC Super League", "2021-2022").get_fixtures(),
        pb.scrapers.FootballData("GRC Super League", "2022-2023").get_fixtures(),
    ]
)[:-1]

weight = pb.models.dixon_coles_weights(df["date"], 0.001)
clf = pb.models.DixonColesGoalModel(
df["goals_home"], df["goals_away"], df["team_home"], df["team_away"], weight
)
clf.fit()

print(clf)
print(clf.predict("Olympiakos", "Asteras Tripolis"))

I'll look into adding constraints around the value that rho is allowed to be to help minimise this in the future