jimmyday12 / fitzRoy

A set of functions to easily access AFL data
https://jimmyday12.github.io/fitzRoy
Other
129 stars 27 forks source link

AFLTables Extract Has Fewer Unique IDs Than Debutants On AFLTables #72

Open TonyCorke opened 5 years ago

TonyCorke commented 5 years ago

Please briefly describe your problem and what output you expect.

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.


According to AFLTables as at the end of R3 2019, there have been 12,710 debutants. There are only 12,703 unique IDs in the AFLTables extract.

``` r
library(fitzRoy)

stats = get_afltables_stats(start_date = "1897-01-01", end_date = "2019-06-01") 
#> Returning data from 1897-01-01 to 2019-06-01
#> Finished getting afltables data
length(unique(stats$ID))
#> [1] 12703
setdiff(1:12710, unique(stats$ID))
#> [1] 12581 12677 12678 12679 12680 12681 12682 12683
setdiff(unique(stats$ID), 1:12710)
#> [1] 0

Created on 2019-04-14 by the reprex package (v0.2.1)

afableco commented 5 years ago

The difference are due to mis-codings. This following accounts for the 7 player discrepancy that @TonyCorke highlighted:

  1. Arthur Davidson there have been two. The first played for Fitzroy in 1897. In the games he played he was recorded as his contemporary Alex Davidson (ID 4350). The second (ID 2755) played for Hawthorn in 1939.
  2. George McLeod there have been two; both played for St Kilda at the the turn of the century. The one who played in 1903 has been recorded as the other George McLeod.
  3. Archie Richardson (ID 4528) - this is an interesting one. Archie played in the VFA for Richmond, and it was believed that he played for St Kilda although this now seems to be discredited. The games have been credited to three separate Richardsons: 1898 Mr Richardson, 1900 William Richardson, and 1901 Alfred Richardson.
  4. Jim Dorgan (ID 2796), who played for South Melbourne, has been coded for Jack Dorgan who played for Melbourne.
  5. Alex Johnston (ID 5395) played only once for Richmond in 1908, you have him as playing twice. You do not have Walter Johnston who played in Round 8.

The following are inconsistencies between the names used in the FitRoy package and AFL Tables. They don't in affect the statistics, but may cause problems if the data sitting behind the player ids is ever rescraped.

  1. Kelly Robinson (ID 4375). should be James Robinson
  2. Alf McDougall (ID 4577) should be Abe McDougall
  3. Alex Barningham (ID 5760) should be Alick Barningham
  4. Phonse Hayes (ID 7212) this matches up with Australianfootball.com, but AFL Tables has Alf Hayes
  5. Allan Rogers (ID 3613) this matches up with Australianfootball.com, but AFL Tables has Allen Rogers
  6. Andy McDonnell (ID 5240) should be McDonell
  7. Arch Middleton (ID 4264) is listed on AFL Tables twice - once as Arch Middleton and once as Arthur Middleton. Australianfootball.com has him as the latter. This is not double counted because the all the AFL Tables refer to Arch.
  8. Garry Lowe (ID 10792) this matches up with Australianfootball.com, but AFL Tables has Gary Lowe.
  9. Harrison Himmelberg (ID 12462) AFL Tables uses Harry.
  10. Jack Matthews (ID 8376) this matches up with Australianfootball.com, but AFL Tables has Mathews.
  11. Jay Kennedy-Harris (ID 12245) AFL Tables does not hyphenate the Kennedy Harris.
  12. Bob Hooper (ID 4695) AFL Tables uses John rather than the shortened form of 'Bobadil'.
  13. Matthew de Boer (ID 11746) AFL Tables uses Matt.
  14. Patrick Ryder (ID 4144) AFL Tables uses Paddy.
  15. Ernie Blencowe (ID 5234) is listed on AFL Tables twice - once as Percy Blencowe and once as Ernie Blencowe. Australianfootball.com has him as the latter. This is not double counted because the all the AFL Tables refer to Percy.
  16. Jim Darcy (ID 4318) should be Tom Darcy.
  17. Pos Watson (ID 4570) AFL Tables has this as Unknown Watson, but it has Pos's date of birth etc.
  18. Terry De Konning (ID 11103) should be De Koning.
TonyCorke commented 5 years ago

This is fabulous @afableco. Thank you.

Am I right that we're still one short of the seven we need though, as we get from the changes:

So that's +7 and -1 for a net gain of 6.

Or, have I misinterpreted your explanation?

afableco commented 5 years ago

Sadly, you are correct. I forgot to net off Archie Richardson. I will try and get back to this on the weekend to see if I can work out who else is missing.

TonyCorke commented 5 years ago

No rush at all - and thank you for looking at the issue I raised so quickly!

afableco commented 5 years ago

The answer is Tom Darcy. In my original note, I had that Jim Darcy (ID 4318) should have been Tom Darcy, but it seems they are two separate people. Tom played for South Melbourne had his first game 1904-09-03, and Jim played for Essendon and had his first game 1897-05-08.

There are other issues with the data (eg Cam Rayner is recorded as Heber Quinton in 2018).

TonyCorke commented 5 years ago

Perfect! Thanks again.

Below is some code that can be used to patch the data:

library(fitzRoy)

dat <- get_afltables_stats(start_date = "1897-05-01", end_date = "2019-05-21")

Fix Arthur Davidson (recorded as Alex Davidson)

dat$ID[dat$ID == 4350 & dat$Playing.for == "Fitzroy" & dat$Season == 1898 & dat$Round %in% c(7,10)] = 15000 dat$First.name[dat$ID == 4350 & dat$Playing.for == "Fitzroy" & dat$Season == 1898 & dat$Round %in% c(7,10)] = "Arthur" dat$Surname[dat$ID == 4350 & dat$Playing.for == "Fitzroy" & dat$Season == 1898 & dat$Round %in% c(7,10)] = "Davidson"

Fix George McLeod (there were two)

dat$ID[dat$First.name == "George" & dat$Surname == "McLeod" & dat$Playing.for == "St Kilda" & dat$Season == 1903] = 15001

Fix Archie Richardson (three different guys)

dat$ID[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1898] = 15002 dat$First.name[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1898] = "Mr" dat$Surname[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1898] = "Richardson"

dat$ID[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1900] = 15003 dat$First.name[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1900] = "William" dat$Surname[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1900] = "Richardson"

dat$ID[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1901] = 15004 dat$First.name[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1901] = "Alfred" dat$Surname[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1901] = "Richardson"

Fix Jack Dorgan (recorded as Jim Dorgan)

dat$ID[dat$First.name == "Jim" & dat$Surname == "Dorgan" & dat$Season == 1949] = 15005 dat$First.name[dat$First.name == "Jim" & dat$Surname == "Dorgan" & dat$Season == 1949] = "Jack" dat$Surname[dat$First.name == "Jim" & dat$Surname == "Dorgan" & dat$Season == 1949] = "Dorgan"

Fix Walter Johnston (recorded as Alex Johnston)

dat$ID[dat$First.name == "Alex" & dat$Surname == "Johnston" & dat$Playing.for == "Richmond" & dat$Season == 1908 & dat$Round == 8] = 15006 dat$First.name[dat$First.name == "Alex" & dat$Surname == "Johnston" & dat$Playing.for == "Richmond" & dat$Season == 1908 & dat$Round == 8] = "Walter" dat$Surname[dat$First.name == "Alex" & dat$Surname == "Johnston" & dat$Playing.for == "Richmond" & dat$Season == 1908 & dat$Round == 8] = "Johnston"

Fix Tom Darcy (recorded as Jim)

dat$ID[dat$First.name == "Jim" & dat$Surname == "Darcy" & dat$Playing.for == "Sydney" & dat$Season == 1904 & dat$Round == 17] = 15007 dat$First.name[dat$First.name == "Jim" & dat$Surname == "Darcy" & dat$Playing.for == "Sydney" & dat$Season == 1904 & dat$Round == 17] = "Tom" dat$Surname[dat$First.name == "Jim" & dat$Surname == "Darcy" & dat$Playing.for == "Sydney" & dat$Season == 1904 & dat$Round == 17] = "Darcy"

jimmyday12 commented 5 years ago

Thanks heaps for all this guys. I'm going to try block out some time to focus on some of these in the coming weeks.

I will need to work out which issues are to do with fitzRoy, versus which are to do with the underlying data on afltables.com. My general philosophy is to leave things as they appear on afltables.com and try get Paul who runs the website to fix it there. But some helper functions to clean the data may also be useful - will have to think about it!

Thanks for all the work so far identifying them!

peteowen1 commented 1 month ago

fixed by https://github.com/jimmyday12/fitzRoy/pull/235