Closed owendavies93 closed 3 years ago
We should also add a unique constraint to understat and football-data IDs in the players
table.
There's an issue with non-ascii characters incorrectly matching as well, it seems:
mandyville=# select first_name, last_name from players where understat_id = 71;
first_name | last_name
------------+--------------
Berkay | Özcan
Nicolás | González
Santiago | Ascacíbar
Emiliano | Insúa
Timo | Baumgartl
Philipp | Förster
Mario | Gómez García
Timo Baumgartl is the correct player here.
mandyville=# select count(understat_id) from players where understat_id is not null;
count
-------
3522
(1 row)
mandyville=# select count(distinct understat_id) from players where understat_id is not null;
count
-------
3452
(1 row)
This has improved somewhat after fixing this bug
Once this is fixed, we should add a unique constraint to the understat_id column to prevent this from reoccurring.
I've now manually corrected all the duplicate IDs.
Around 100 understat IDs (3%) are duplicated in the data set. e.g. is https://understat.com/player/1521 Gabriel Boschilia or Zé Gabriel? Or neither?! (it's actually Gabriel Barbosa who's not even in the dataset).
The understat ID code needs refining to avoid this, and we need a method for fixing the incorrect ones.