martj42 / international_results

https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017
Creative Commons Zero v1.0 Universal
133 stars 29 forks source link

Start thinking about adding goalscorers #11

Closed martj42 closed 1 year ago

martj42 commented 3 years ago

Perhaps in a separate file. Perhaps creating some sort of a match id. Perhaps also a player id.

There are around 125k goals to account for so this might take some time. Will need to look into existing databases that are usable. Probably Wikipedia; rsssf might have most of the data but doesn't have consistent formatting. Neither does Wikipedia to be fair. I wonder if I can plug into any commercial feeds for at least the data from recent years...

m-rossini commented 3 years ago

Ideally would be a timeline (Player A scored at 45min and so on) )as a different file, linked with a match id. I think there could be a player id, but I do not think it is needed unless there is a desire to start keeping player data as well

martj42 commented 3 years ago

Yeah, maybe. Just thinking there must be some repeat names across countries and within countries as well. So if a John Smith scored a goal for Barbados, England in the 1920s, and England in the 1990s, I wonder what the best way to make it clear that it's 3 different people would be.

AlunHewinson commented 3 years ago

You're going to want to create a goalscorer table with an arbitrary ID for each player you add there. Then add a match_id to the match table. Then do a linking table called something like "match_scorers".

The reason you should do it this way is because it's going to get messy with that small subset of players who have scored for more than one country. For example, Saman Ghoddos for Sweden and Iran (for both countries in the same calendar year!)

You'll also need to think about how you represent which country benefitted from the goal. E.g. 2021-07-02 Switzerland v Spain. Denis Zakaria was playing for Switzerland but his OG counts for Spain. Do you store the player's team, the team that benefitted from the goal, or both? Clearly an OG flag would be desirable whichever way you do it.

A penalty flag is also desirable.

Lastly, a "period" indicator might be a useful thing (1st half, 2nd half, 1st half ET, 2nd half ET, penalty shootout?)

I will be able to help with this when things have calmed down for me at work, but I've just joined a new client so for now this is a brain dump of ideas.

martj42 commented 3 years ago

If I were to get the ball rolling right now, I think I'd have the following columns:

match_id minute team player_id (or just player_name) own_goal penalty (Not sure about penalties - ultimately who cares if a goal was a penalty or not)

Penalty shootouts I'd ignore like they're ignored in the results file (and also ignored in official goal counting). Though maybe the results file should have something added to show who won a penalty shootout. Right now, for example, it's not clear who won the 94 World Cup.

Halves wouldn't be needed if we have minutes. If two goals were scored in the same minute I suppose the order in the file could show which was first (don't feel too good about having data in row order but whatevs). Goals scored in stoppage time could also just be 45 and 90 instead of, say, 45+3 to keep the column numeric.

I think the player_id for Saman Ghoddos should be the same one for Sweden and Iran regardless. Because it is the same person, not two people with the same name.

Anyway, in general, the data should be as minimalistic as possible to keep collection as easy as possible, which is why I'm leaning towards no penalty flag or added minutes.

So, probably 3 files: results.csv, ids.csv and goals.csv.

m-rossini commented 3 years ago

I think there is no need for own_goal and penalty columns (there is no way a penalty is OG), maybe a type column? Also, shootout score on main file would be interesting as additional columns Minutes is kind of tricky... Imagine a goal was scored at 2 min of 2nd half.... How it would be recorded? 47 minutes? But what is a goal was also scored in the same game at minute 49 (due to stoppage) of first half?. I think would be more precise if we have quarters (1,2,3,4 as 1st half, 2nd half, 1st half overtime, 2nd half overtime), plus minutes. This would allow a timeline to be easily built.... Is there any case of players scoring on the same match for 2 teams having the same name? Like Joe Doe for team A and Joe Doe for Team B, both scoring at same match?

I understand keep things lean and simple, but if goals are added (great addition BTW) I think there should be possible to clearly indicate goal/team scorer and build a timeline. Of course it depends the reason goals would be introduced, which makes me ask, what is the benefits of adding goals and scorer?

martj42 commented 3 years ago

The 'industry standard' is to call everything in the first-half stoppage time minute 45. Or, 45+4 so there isn't an overlap with the second half. Going with just 45 is easier and there probably isn't that great historic record-keeping with the added minutes.

AlunHewinson commented 3 years ago

Right now, for example, it's not clear who won the 94 World Cup.

Baggio's penalty still hasn't landed. A winner will be declared once it does ;)

m-rossini commented 3 years ago

The 'industry standard' is to call everything in the first-half stoppage time minute 45. Or, 45+4 so there isn't an overlap with the second half. Going with just 45 is easier and there probably isn't that great historic record-keeping with the added minutes.

I contest this 'industry standard'. This is in Europe. In South America no one says a goal was scored at 70 minutes. We say is was scored 25 minutes of second half. Maybe it is a bias, but I find splitting the match in halves way easier to understand and to create a timeline

m-rossini commented 2 years ago

This is something we can start with something like this: Information from timeline tab at: https://www.fifa.com/tournaments/mens/worldcup/2018russia/match-center/300331552

World Cup 2018 Final France vs Croatia time line file match id; period; minute; player id; penalty; own goal N1; 1; 18; p1; false; true N1; 1; 28; p2; false; false N1; 1; 38; p3; false; false N1; 2; 14; p4; false; false N1; 2; 20; p5; false; false N1; 2; 24; p1; false; false match id is a new field we need to add to matches file

player file player id; player name; team P1; MANDŽUKIĆ; Croatia P2; PERIŠIĆ ; Croatia P3; GRIEZMANN; France P4; POGBA; France P5; MBAPPE; France

martj42 commented 1 year ago

As of https://github.com/martj42/international_results/commit/9341294f7428d7aa35a44b30970aaef1eecdab03, I have added goalscorers.csv. So far the goalscorers are for the World Cup, Euros, and Copa America.