jimmyday12 / fitzRoy

A set of functions to easily access AFL data
https://jimmyday12.github.io/fitzRoy
Other
129 stars 27 forks source link

get_afltables_stats() does not reconcile with AFLTables data #120

Closed TonyCorke closed 3 years ago

TonyCorke commented 4 years ago

get_afltables_stats(start_date = "1897-05-01", end_date = "2020-05-21") returns data that does not match that in AFLTables

As at 26 May 2020, the required changes are as follows (note than none correct Jumper Number data). These changes perfectly align the career totals produced for all players with those on the AFLTables Big Lists page.

I have provided a high level description of the issue and then the R code required to fix it

(0) Arthur Davidson is recorded as Alex Davidson dat$ID[dat$ID == 4350 & dat$Playing.for == "Fitzroy" & dat$Season == 1898 & dat$Round %in% c(7,10)] = 15000 dat$First.name[dat$ID == 4350 & dat$Playing.for == "Fitzroy" & dat$Season == 1898 & dat$Round %in% c(7,10)] = "Arthur" dat$Surname[dat$ID == 4350 & dat$Playing.for == "Fitzroy" & dat$Season == 1898 & dat$Round %in% c(7,10)] = "Davidson"


(1) Only one of the two George McLeods are recognised dat$ID[dat$First.name == "George" & dat$Surname == "McLeod" & dat$Playing.for == "St Kilda" & dat$Season == 1903] = 15001


(2) Only one of the three Archie Richardsons are recognised dat$ID[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1898] = 15002 dat$First.name[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1898] = "Mr" dat$Surname[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1898] = "Richardson"

dat$ID[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1900] = 15003 dat$First.name[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1900] = "William" dat$Surname[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1900] = "Richardson"

dat$ID[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1901] = 15004 dat$First.name[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1901] = "Alfred" dat$Surname[dat$First.name == "Archie" & dat$Surname == "Richardson" & dat$Playing.for == "St Kilda" & dat$Season == 1901] = "Richardson"


(3) Jack Dorgan is recorded as Jim Dorgan dat$ID[dat$First.name == "Jim" & dat$Surname == "Dorgan" & dat$Season == 1949] = 15005 dat$First.name[dat$First.name == "Jim" & dat$Surname == "Dorgan" & dat$Season == 1949] = "Jack" dat$Surname[dat$First.name == "Jim" & dat$Surname == "Dorgan" & dat$Season == 1949] = "Dorgan"


(4) Walter Johnston is recorded as Alex Johnston dat$ID[dat$First.name == "Alex" & dat$Surname == "Johnston" & dat$Playing.for == "Richmond" & dat$Season == 1908 & dat$Round == 8] = 15006 dat$First.name[dat$First.name == "Alex" & dat$Surname == "Johnston" & dat$Playing.for == "Richmond" & dat$Season == 1908 & dat$Round == 8] = "Walter" dat$Surname[dat$First.name == "Alex" & dat$Surname == "Johnston" & dat$Playing.for == "Richmond" & dat$Season == 1908 & dat$Round == 8] = "Johnston"


(5) Jim Darcy is recorded as Tom Darcy dat$ID[dat$First.name == "Jim" & dat$Surname == "Darcy" & dat$Playing.for == "Sydney" & dat$Season == 1904 & dat$Round == 17] = 15007 dat$First.name[dat$First.name == "Jim" & dat$Surname == "Darcy" & dat$Playing.for == "Sydney" & dat$Season == 1904 & dat$Round == 17] = "Tom" dat$Surname[dat$First.name == "Jim" & dat$Surname == "Darcy" & dat$Playing.for == "Sydney" & dat$Season == 1904 & dat$Round == 17] = "Darcy"


(6) Heber Quinton is erroneously attached to the games for four other players Rows_to_Fix_BL = which(dat$Surname == "Quinton" & dat$Date > as.Date("1908-01-01", format = "%Y-%m-%d") & dat$Playing.for == "Brisbane Lions") Rows_to_Fix_Freo = which(dat$Surname == "Quinton" & dat$Date > as.Date("1908-01-01", format = "%Y-%m-%d") & dat$Playing.for == "Fremantle") Rows_to_Fix_Belcher = which(dat$Surname == "Quinton" & dat$Season == 1907 & dat$Round %in% c(1,2, 3,6,7,9,12) & dat$Playing.for == "Essendon") Rows_to_Fix_Robinson = which(dat$Surname == "Quinton" & dat$Season == 1901 & dat$Round == 3 & dat$Playing.for == "Fitzroy")

dat$First.name[Rows_to_Fix_BL] = "Cam" dat$Surname[Rows_to_Fix_BL] = "Rayner" dat$ID[Rows_to_Fix_BL] = 12592

dat$First.name[Rows_to_Fix_Freo] = "Cameron" dat$Surname[Rows_to_Fix_Freo] = "Sutcliffe" dat$ID[Rows_to_Fix_Freo] = 12104

dat$First.name[Rows_to_Fix_Belcher] = "Alan" dat$Surname[Rows_to_Fix_Belcher] = "Belcher" dat$ID[Rows_to_Fix_Belcher] = 5187

dat$First.name[Rows_to_Fix_Robinson] = "James" dat$Surname[Rows_to_Fix_Robinson] = "Robinson" dat$ID[Rows_to_Fix_Robinson] = 4375


(7) Les Hughson appears instead of Mick Hughson in a number of games Rows_to_Fix_Hughson = which(dat$ID == 3098 & dat$Season == 1937 & dat$Round %in% c(10,12) & dat$Playing.for == "Fitzroy")

dat$First.name[Rows_to_Fix_Hughson] = "Les" dat$ID[Rows_to_Fix_Hughson] = 3096


(8) Bert Barling appears instead of Cecil Sandford in a number of games Rows_to_Fix_Sandford = which(dat$ID == 4410 & dat$Season == 1898 & dat$Round == 4 & dat$Playing.for == "Geelong")

dat$First.name[Rows_to_Fix_Sandford] = "Bert" dat$Surname[Rows_to_Fix_Sandford] = "Barling" dat$ID[Rows_to_Fix_Sandford] = 4382


(9) George Hastings appears instead of Harry Wright in a number of games Rows_to_Fix_Wright = which(dat$ID == 4343 & dat$Season == 1901 & dat$Round == 2 & dat$Playing.for == "Essendon")

dat$First.name[Rows_to_Fix_Wright] = "George" dat$Surname[Rows_to_Fix_Wright] = "Hastings" dat$ID[Rows_to_Fix_Wright] = 4324


(10) Clyde Smith appears instead of Basil Smith in a number of games Rows_to_Fix_Smith = which(dat$ID == 6805 & dat$Season == 1921 & dat$Round == 18 & dat$Playing.for == "Collingwood")

dat$First.name[Rows_to_Fix_Smith] = "Clyde" dat$ID[Rows_to_Fix_Smith] = 6877


(11) Bob King appears instead of George King in a number of games Rows_to_Fix_King = which(dat$ID == 6183 & dat$Season == 1916 & dat$Round %in% c(4,5) & dat$Playing.for == "Fitzroy")

dat$First.name[Rows_to_Fix_King] = "Bob" dat$ID[Rows_to_Fix_King] = 6405


(12) Michael OGorman, George Sutherland and Fred Warry appear for one another in a number of games

No 1. Switch George Sutherland for Michael OGorman in Round 10

Rows_to_Fix_OSW_1 = which(dat$ID == 4295 & dat$Season == 1900 & dat$Round == 10 & dat$Playing.for == "St Kilda")

dat$First.name[Rows_to_Fix_OSW_1] = "George" dat$Surname[Rows_to_Fix_OSW_1] = "Sutherland" dat$ID[Rows_to_Fix_OSW_1] = 4735

No 2. Switch Fred Warry for Michael OGorman in Round 5

Rows_to_Fix_OSW_2 = which(dat$ID == 4295 & dat$Season == 1900 & dat$Round == 5 & dat$Playing.for == "St Kilda")

dat$First.name[Rows_to_Fix_OSW_2] = "Fred" dat$Surname[Rows_to_Fix_OSW_2] = "Warry" dat$ID[Rows_to_Fix_OSW_2] = 4736

No 3. Switch Michael OGorman for Fred Warry in Round 4

Rows_to_Fix_OSW_3 = which(dat$ID == 4736 & dat$Season == 1900 & dat$Round == 4 & dat$Playing.for == "St Kilda")

dat$First.name[Rows_to_Fix_OSW_3] = "Michael" dat$Surname[Rows_to_Fix_OSW_3] = "OGorman" dat$ID[Rows_to_Fix_OSW_3] = 4295


(13) Bob McCaskill, Don Donald and Ralph Empey appear for one another in a number of games

No 1. Switch Ralph Empey for Bob McCaskill in Round 14

Rows_to_Fix_MDE_1 = which(dat$ID == 3286 & dat$Season == 1925 & dat$Round == 14 & dat$Playing.for == "Richmond")

dat$First.name[Rows_to_Fix_MDE_1] = "Ralph" dat$Surname[Rows_to_Fix_MDE_1] = "Empey" dat$ID[Rows_to_Fix_MDE_1] = 6962

No 2. Switch Donald Don for Ralph Empey in Round 10

Rows_to_Fix_MDE_2 = which(dat$ID == 6962 & dat$Season == 1925 & dat$Round == 10 & dat$Playing.for == "Richmond")

dat$First.name[Rows_to_Fix_MDE_2] = "Donald" dat$Surname[Rows_to_Fix_MDE_2] = "Don" dat$ID[Rows_to_Fix_MDE_2] = 2792


(14) Clarrie Dall appears instead of Charlie McMillan in a number of games Rows_to_Fix_McMillan = which(dat$ID == 5991 & dat$Season == 1911 & dat$Round == 17 & dat$Playing.for == "Fitzroy")

dat$First.name[Rows_to_Fix_McMillan] = "Clarrie" dat$Surname[Rows_to_Fix_McMillan] = "Dall" dat$ID[Rows_to_Fix_McMillan] = 5983


(15) Robert and George White appear in the wrong games

No.1

Rows_to_Fix_White_1 = which(dat$ID == 6416 & dat$Season == 1916 & dat$Round == 5 & dat$Playing.for == "Carlton")

dat$First.name[Rows_to_Fix_White_1] = "Robert" dat$ID[Rows_to_Fix_White_1] = 6417

No.2

Rows_to_Fix_White_2 = which(dat$ID == 6417 & dat$Season == 1916 & dat$Round == 8 & dat$Playing.for == "Carlton")

dat$First.name[Rows_to_Fix_White_2] = "George" dat$ID[Rows_to_Fix_White_2] = 6416


(16) A number of people with the same names as other players appear in the squads for games thet did not play

Remove extra George Shaw in Fitzroy R5 1912

dat = dat %>% filter(!(ID == 4580 & dat$Season == 1912 & dat$Round == 5))

Remove extra Peter Stephens in Geelong R1 1907

dat = dat %>% filter(!(ID == 10665 & dat$Season == 1907 & dat$Round == 1))

Remove extra Jim Stewart in St Kilda R3 1907

dat = dat %>% filter(!(ID == 5685 & dat$Season == 1907 & dat$Round == 3))

Remove extra Albert Pannam in Collingwood R5 1907

dat = dat %>% filter(!(ID == 3500 & dat$Season == 1907 & dat$Round == 5))

Remove extra Albert Pannam in Collingwood R12 1907

dat = dat %>% filter(!(ID == 3500 & dat$Season == 1907 & dat$Round == 12))


(17) The dates for replayed Finals are incorrect dat$Date[dat$Season == 1928 & dat$Round == "SF" & dat$Attendance == 42175] = as.Date("1928-09-22", format = "%Y-%m-%d") dat$Date[dat$Season == 1946 & dat$Round == "SF" & dat$Attendance == 64400] = as.Date("1946-09-21", format = "%Y-%m-%d") dat$Date[dat$Season == 1948 & dat$Round == "GF" & dat$Attendance == 52226] = as.Date("1948-10-09", format = "%Y-%m-%d") dat$Date[dat$Season == 1962 & dat$Round == "PF" & dat$Attendance == 99203] = as.Date("1962-09-22", format = "%Y-%m-%d") dat$Date[dat$Season == 1972 & dat$Round == "SF" & dat$Attendance == 92670] = as.Date("1972-09-23", format = "%Y-%m-%d") dat$Date[dat$Season == 1977 & dat$Round == "GF" & dat$Attendance == 98491] = as.Date("1977-10-01", format = "%Y-%m-%d") dat$Date[dat$Season == 1990 & dat$Round == "GF" & dat$Attendance == 53520] = as.Date("1990-09-15", format = "%Y-%m-%d") dat$Date[dat$Season == 2010 & dat$Round == "GF" & dat$Attendance == 93853] = as.Date("2010-10-02", format = "%Y-%m-%d")


(18) Rows for a number of Finals are duplicated dat = dat[!duplicated(dat[c("ID","First.name", "Surname", "Date", "Playing.for")]),]

liam-crow commented 4 years ago

@TonyCorke was this ever incorporated? Also do you have a stand alone script that corrects all these?

I ran into a problem recently with Albert Pannam, apparently he had 27 years between games (obviously incorrect). Any known reason as to why these would slip through despite being (mostly) accurate on afltables?

TonyCorke commented 4 years ago

Not sure, mate, but happy to share my R script that adjusts for them if that would be helpful. Shoot me an e-mail if so.

On 22 Jun 2020, at 8:37 pm, liam-crow notifications@github.com wrote:

@TonyCorke was this ever incorporated? Also do you have a stand alone script that corrects all these?

I ran into a problem recently with Albert Pannam, apparently he had 27 years between games (obviously incorrect). Any known reason as to why these would slip through despite being (mostly) accurate on afltables?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jimmyday12 commented 4 years ago

@liam-crow This hasn't been fixed yet but on the ever growing to do list! Always open to a PR with Tony's fixes on the raw data you are keen but if not, I'll aim to get something done in the next few weeks.

RE how it happened - Paul from AFLtables supplied me with a csv dump of the historical data which I host on github rather than scraping historical games directly from the site (scraping these manually would take a fairly long time and put heaps of stress on his servers). My best guess is there was some inconsistencies in that file, but it could have also cropped up in some initial processing of that file that I did to get it into a nice format for the package.

Either way - @TonyCorke's code above will fix it, I just need to implement the fix on the file that the package uses.

TonyCorke commented 4 years ago

Completely understandable, James. It must be a nightmare having to wrestle with so many different interfaces and formats.

TC

On Tue, Jun 23, 2020 at 9:27 AM James Day notifications@github.com wrote:

@liam-crow https://github.com/liam-crow This hasn't been fixed yet but on the ever growing to do list! Always open to a PR with Tony's fixes on the raw data you are keen but if not, I'll aim to get something done in the next few weeks.

RE how it happened - Paul from AFLtables supplied me with a csv dump of the historical data which I host on github rather than scraping historical games directly from the site (scraping these manually would take a fairly long time and put heaps of stress on his servers). My best guess is there was some inconsistencies in that file, but it could have also cropped up in some initial processing of that file that I did to get it into a nice format for the package.

Either way - @TonyCorke https://github.com/TonyCorke's code above will fix it, I just need to implement the fix on the file that the package uses.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jimmyday12/fitzRoy/issues/120#issuecomment-647818962, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL2NZDF26PY2ELVVBFSEQHTRX7SHDANCNFSM4NKEZEYQ .

jimmyday12 commented 3 years ago

@TonyCorke while this is fresh - do you have a script handy that fixes this? If not, I can probably just pull out your code from the comment above

jimmyday12 commented 3 years ago

@TonyCorke I think I've got something working on a branch. Would be good to confirm but as far as I can tell, it's fixed all the issues. Want to write some tests before merging into the main branch but if you could test it out that would be really helpful!

To install from the branch, devtools::install_github("jimmyday12/fitzRoy", ref = "tony-data-fix")

When you are done testing, you should re-install the package from CRAN (or the development version)

# CRAN
install.packages("fitzRoy")

# Development
devtools::install_github("jimmyday12/fitzRoy")

As a bit of a side not, since the data lives on a separate repo (https://github.com/jimmyday12/fitzroy_data), once I've confirmed it's working, we won't need an actual release or change to the package. At the moment, just keeping all the changes in a separate branch on both repos.

jimmyday12 commented 3 years ago

As far as I can tell - I fixed this over in jimmyday12/fitzRoy_data@855f817cdce67fc0987ee182a403d27f3c8d583d.

Happy to revisit if needed (and would love some replicating test cases if it's still wrong)