Correct duplicate understat IDs

owendavies93 / mandyville-data

Data fetching, external API interaction and data storage for mandyville.

MIT License

1 stars 0 forks source link

Correct duplicate understat IDs #6

Closed owendavies93 closed 3 years ago

owendavies93 commented 3 years ago

mandyville=# select count(understat_id) from players where understat_id is not null;
 count 
-------
  3164
(1 row)

mandyville=# select count(distinct understat_id) from players where understat_id is not null;
 count 
-------
  3063
(1 row)

Around 100 understat IDs (3%) are duplicated in the data set. e.g. is https://understat.com/player/1521 Gabriel Boschilia or Zé Gabriel? Or neither?! (it's actually Gabriel Barbosa who's not even in the dataset).

The understat ID code needs refining to avoid this, and we need a method for fixing the incorrect ones.

owendavies93 commented 3 years ago

We should also add a unique constraint to understat and football-data IDs in the players table.

owendavies93 commented 3 years ago

There's an issue with non-ascii characters incorrectly matching as well, it seems:

mandyville=# select first_name, last_name from players where understat_id = 71;
 first_name |  last_name   
------------+--------------
 Berkay     | Özcan
 Nicolás    | González
 Santiago   | Ascacíbar
 Emiliano   | Insúa
 Timo       | Baumgartl
 Philipp    | Förster
 Mario      | Gómez García

Timo Baumgartl is the correct player here.

owendavies93 commented 3 years ago

mandyville=# select count(understat_id) from players where understat_id is not null;
 count 
-------
  3522
(1 row)

mandyville=# select count(distinct understat_id) from players where understat_id is not null;
 count 
-------
  3452
(1 row)

This has improved somewhat after fixing this bug

owendavies93 commented 3 years ago

Once this is fixed, we should add a unique constraint to the understat_id column to prevent this from reoccurring.

owendavies93 commented 3 years ago

I've now manually corrected all the duplicate IDs.