statsbomb / open-data

Free football data from StatsBomb
https://statsbomb.com/resource-centre/
Other
2.43k stars 767 forks source link

Documentation mismatch #3

Closed Justice4Joffrey closed 5 years ago

Justice4Joffrey commented 6 years ago

Some fields use hyphens instead of underscores for variable names and certain fields (e.g. 'off_camera') aren't described at all.

JoGall commented 5 years ago

I'd also really like to know what variables like density, density.incone, AngleDeviation, Shot5, Shot6, etc... mean, and whether variables like DistToGoal and DistToKeeper are given in metres or arbitrary pitch units.

ElSaico commented 5 years ago

Those fields described by @JoGall are generated by the data cleaning functions of https://github.com/statsbomb/StatsBombR.

Got a decent understanding because I'm finishing to port them all to Python (check out https://github.com/ElSaico/pyStatsBomb in the next few days - I'll owe you all the API functionality because I lack the necessary $resources$ to access it).

Shots

Shot5, Shot6, etc. seem to be earlier glitches from importing that already got fixed: https://github.com/statsbomb/StatsBombR/commit/2e386478ac7397265a568370b7639aaf02330856

All distance variables use the same unit as the positions, i.e. they're scaled to a 120x80 pitch. DistToGoal is exactly what it implies, but DistToKeeper refers, counter-intuitively, to the distance between keeper and goal (!). The distance between shot and goal is in DistSGK.

All angular variables are in degrees. AngleToGoal and AngleToKeeper are the opening angles formed by DistToGoal and DistToKeeper, respectively, while AngleDeviation is the opening angle between both.

Freeze frames

density and density.income are both described in the README:

  • Density is calculated as the aggregated inverse distance for each defender behind the ball.
  • Density in the cone is the density filtered for only defenders who are in the cone between the shooter, and each goal post.

The other variables are:

All variables exclude the defending goalkeeper, except obviously for InCone.GK

Time

All extra time-related variables are in milliseconds and seem to have pretty descriptive names.

JoGall commented 5 years ago

Thanks for taking the time for such a detailed reply @ElSaico!

I thought DistToKeeper was much lower than expected so wondered if it was given in an unexpected unit of measurement, that makes more sense! For anyone else reading, DistToKeeper is the distance from the GK to the centre of the goal (not the nearest part of the goal line).

I didn't notice density and density.incone in the documentation when I first looked -- seems they'd be very useful for xG models. I haven't seen several of the other variables (e.g. DistSGK, AttackersBehindBall, DefArea) as I don't think they're available in the free data but good to know.

Good luck with pyStatsBomb and making the data accessible to more people!

deepxg commented 5 years ago

At some point we'll tidy up StatsBombR and document the inner workings of @YamStats brain, but for the most part it's provided as is to give people a bit of a leg up using the data. Happy to see issues raised in the other repo for any other improvements. In the meantime, the docs have been updated today so there shouldn't be anything in the raw data that's not covered now.