k5cents / whatr

Read Jeopardy game data in R
https://kiernann.github.io/whatr/
GNU General Public License v3.0
9 stars 4 forks source link

`whatr_data()` cannot parse games with tiebreaker questions #12

Open john-b-edwards opened 2 years ago

john-b-edwards commented 2 years ago

Whenever a game goes to a tiebreaker (five such instances since 2018, so fairly rare), whatr_data() cannot parse information about the board/game state. This is a fairly niche edge case albeit but still one that breaks the existing code.

regular_game <- 5961
tiebreaker_game <- 5922
regular_html <- whatr::whatr_html(regular_game)
tiebreaker_html <- whatr::whatr_html(tiebreaker_game)

# expected result
regular_html |>
  whatr::whatr_data()
#> $info
#> # A tibble: 1 × 3
#>    game  show date      
#>   <int> <int> <date>    
#> 1  5961  7744 2018-04-19
#> 
#> $summary
#> # A tibble: 3 × 5
#>   name    final coryat right wrong
#>   <chr>   <int>  <int> <int> <int>
#> 1 William     0  12200    17     0
#> 2 Hannah   7600  15600    15     0
#> 3 Dhruv    9000  14800    19     2
#> 
#> $players
#> # A tibble: 3 × 4
#>   first   last  occupation                                     from             
#>   <chr>   <chr> <chr>                                          <chr>            
#> 1 Dhruv   Gaur  Freshman At Brown University                   Gainesville, Geo…
#> 2 Hannah  Sage  Sophomore At The University Of Central Florida Sarasota, Florida
#> 3 William Scott Freshman At Tufts University                   Los Altos, Calif…
#> 
#> $scores
#> # A tibble: 56 × 5
#>    round     i name    score double
#>    <int> <int> <chr>   <int> <lgl> 
#>  1     1     1 Dhruv     200 FALSE 
#>  2     1     2 William   800 FALSE 
#>  3     1     3 William   400 FALSE 
#>  4     1     3 Dhruv    -400 FALSE 
#>  5     1     4 William   200 FALSE 
#>  6     1     5 Dhruv     400 FALSE 
#>  7     1     6 Dhruv     800 FALSE 
#>  8     1     8 William   600 FALSE 
#>  9     1     9 Dhruv     600 FALSE 
#> 10     1    10 William  1000 FALSE 
#> # … with 46 more rows
#> 
#> $board
#> # A tibble: 59 × 7
#>    round   col   row     i category                         clue          answer
#>    <int> <int> <int> <int> <chr>                            <chr>         <chr> 
#>  1     1     1     1     4 You Kids & Your Music These Days 'What About … Pink  
#>  2     1     1     2     5 You Kids & Your Music These Days He's The Mul… Bruno…
#>  3     1     1     3     9 You Kids & Your Music These Days This Band Ha… Portu…
#>  4     1     1     4    12 You Kids & Your Music These Days This 2018 Gr… Aless…
#>  5     1     1     5    11 You Kids & Your Music These Days After Being … Shawn…
#>  6     1     2     1    13 Myth-Pourri                      This Top Nor… Odin  
#>  7     1     2     2    14 Myth-Pourri                      To Scientist… A Coy…
#>  8     1     2     3    16 Myth-Pourri                      This Corpora… Coca-…
#>  9     1     2     4     6 Myth-Pourri                      Named For A … An Ai…
#> 10     1     2     5    10 Myth-Pourri                      Athena, The … Miner…
#> # … with 49 more rows

# what happens with tiebreakers
tiebreaker_html |>
  whatr::whatr_data()
#> Error in UseMethod("html_table"): no applicable method for 'html_table' applied to an object of class "xml_missing"

Tiebreaker games seem to affect the following functions in {whatr} in addition to whatr_data():

Other whatr_* functions seem unaffected by this edge case.

Based on the J! Archive, believe this is the list of all tiebreaker games:

k5cents commented 2 years ago

Nice find! whatr_data() is really just a list of all the other functions put together.

list(
    info = whatr_airdate(showgame),
    summary = whatr_synopsis(showgame),
    players = whatr_players(showgame),
    scores = whatr_scores(showscores),
    board = whatr_board(showgame)
  )

So it must be one/all of those that is breaking.

Given the rarity of this bug, I can't say I will solve it myself very soon.

But anybody should feel free to dig into the code and submit a PR with any fixes!

sebastianernstroth commented 2 years ago

Hi both,

I came across parsing issues when using whatr_data() recently as well and took a deeper look into what was causing them as well as the actual number of occurrences across all games. In fact, I found 14 games which had tiebreakers (as of 25 November). However, beyond this particular issue I also came across several other ones producing potential parsing issues. In particular, the following list highlights the three other main issues:

> missing_games %>%
+     tibble::as_tibble()
# A tibble: 190 × 5
    missing_game tiebreaker_round cumulative_scores triple_round noncharacter_scores
           <int> <lgl>            <lgl>             <lgl>        <lgl>              
  1           20 FALSE            TRUE              FALSE        FALSE              
  2           57 FALSE            TRUE              FALSE        FALSE              
  3          115 FALSE            TRUE              FALSE        FALSE              
  4          124 FALSE            TRUE              FALSE        FALSE              
  5          141 FALSE            TRUE              FALSE        FALSE              
  6          156 FALSE            TRUE              FALSE        FALSE              
  7          182 FALSE            TRUE              FALSE        FALSE              
  8          195 FALSE            TRUE              FALSE        FALSE              
  9          223 FALSE            TRUE              FALSE        FALSE              
 10          315 FALSE            TRUE              FALSE        FALSE              
 11          317 FALSE            TRUE              FALSE        FALSE              
 12          322 FALSE            TRUE              FALSE        FALSE              
 13          324 FALSE            TRUE              FALSE        FALSE              
 14          326 FALSE            TRUE              FALSE        FALSE              
 15          329 FALSE            TRUE              FALSE        FALSE              
 16          330 FALSE            TRUE              FALSE        FALSE              
 17          420 FALSE            TRUE              FALSE        FALSE              
 18          485 FALSE            TRUE              FALSE        FALSE              
 19          500 FALSE            TRUE              FALSE        FALSE              
 20          566 FALSE            TRUE              FALSE        FALSE              
 21          616 FALSE            TRUE              FALSE        FALSE              
 22          629 FALSE            TRUE              FALSE        FALSE              
 23          673 FALSE            TRUE              FALSE        FALSE              
 24          685 FALSE            TRUE              FALSE        FALSE              
 25          737 FALSE            TRUE              FALSE        FALSE              
 26          787 FALSE            TRUE              FALSE        FALSE              
 27          788 FALSE            TRUE              FALSE        FALSE              
 28          902 FALSE            TRUE              FALSE        FALSE              
 29          971 FALSE            TRUE              FALSE        FALSE              
 30         1022 FALSE            FALSE             FALSE        TRUE               
 31         1037 FALSE            TRUE              FALSE        FALSE              
 32         1129 FALSE            TRUE              FALSE        FALSE              
 33         1132 NA               NA                NA           NA                 
 34         1133 NA               NA                NA           NA                 
 35         1134 NA               NA                NA           NA                 
 36         1135 NA               NA                NA           NA                 
 37         1152 TRUE             TRUE              FALSE        FALSE              
 38         1153 NA               NA                NA           NA                 
 39         1271 FALSE            TRUE              FALSE        FALSE              
 40         1305 FALSE            TRUE              FALSE        FALSE              
 41         1348 FALSE            FALSE             FALSE        TRUE               
 42         1361 FALSE            TRUE              FALSE        FALSE              
 43         1421 FALSE            TRUE              FALSE        FALSE              
 44         1430 FALSE            TRUE              FALSE        FALSE              
 45         1440 FALSE            TRUE              FALSE        FALSE              
 46         1473 TRUE             FALSE             FALSE        FALSE              
 47         1477 FALSE            TRUE              FALSE        FALSE              
 48         1538 FALSE            TRUE              FALSE        FALSE              
 49         1592 FALSE            TRUE              FALSE        FALSE              
 50         1689 FALSE            TRUE              FALSE        FALSE              
 51         1755 FALSE            TRUE              FALSE        FALSE              
 52         1851 FALSE            TRUE              FALSE        FALSE              
 53         1989 FALSE            TRUE              FALSE        FALSE              
 54         1999 FALSE            TRUE              FALSE        FALSE              
 55         2140 FALSE            TRUE              FALSE        FALSE              
 56         2172 TRUE             FALSE             FALSE        FALSE              
 57         2175 FALSE            TRUE              FALSE        FALSE              
 58         2298 FALSE            TRUE              FALSE        FALSE              
 59         2347 FALSE            TRUE              FALSE        FALSE              
 60         2389 FALSE            TRUE              FALSE        FALSE              
 61         2464 FALSE            TRUE              FALSE        FALSE              
 62         2471 FALSE            TRUE              FALSE        FALSE              
 63         2481 FALSE            TRUE              FALSE        FALSE              
 64         2536 FALSE            TRUE              FALSE        FALSE              
 65         2585 FALSE            TRUE              FALSE        FALSE              
 66         2792 FALSE            TRUE              FALSE        FALSE              
 67         2927 FALSE            TRUE              FALSE        FALSE              
 68         2950 FALSE            TRUE              FALSE        FALSE              
 69         2966 FALSE            TRUE              FALSE        FALSE              
 70         3008 FALSE            TRUE              FALSE        FALSE              
 71         3081 TRUE             FALSE             FALSE        FALSE              
 72         3213 FALSE            TRUE              FALSE        FALSE              
 73         3314 FALSE            TRUE              FALSE        FALSE              
 74         3386 FALSE            TRUE              FALSE        FALSE              
 75         3396 FALSE            TRUE              FALSE        FALSE              
 76         3508 FALSE            TRUE              FALSE        FALSE              
 77         3575 NA               NA                NA           NA                 
 78         3576 FALSE            FALSE             FALSE        FALSE              
 79         3577 FALSE            TRUE              FALSE        FALSE              
 80         3588 FALSE            TRUE              FALSE        FALSE              
 81         3644 FALSE            TRUE              FALSE        FALSE              
 82         3760 FALSE            TRUE              FALSE        FALSE              
 83         3828 FALSE            TRUE              FALSE        FALSE              
 84         3838 FALSE            TRUE              FALSE        FALSE              
 85         3889 TRUE             FALSE             FALSE        FALSE              
 86         3893 FALSE            TRUE              FALSE        FALSE              
 87         4017 FALSE            TRUE              FALSE        FALSE              
 88         4077 FALSE            TRUE              FALSE        FALSE              
 89         4092 FALSE            TRUE              FALSE        FALSE              
 90         4099 FALSE            TRUE              FALSE        FALSE              
 91         4115 FALSE            TRUE              FALSE        FALSE              
 92         4183 FALSE            TRUE              FALSE        FALSE              
 93         4186 FALSE            TRUE              FALSE        FALSE              
 94         4189 FALSE            TRUE              FALSE        FALSE              
 95         4256 NA               NA                NA           NA                 
 96         4264 NA               NA                NA           NA                 
 97         4271 NA               NA                NA           NA                 
 98         4273 NA               NA                NA           NA                 
 99         4284 NA               NA                NA           NA                 
100         4357 FALSE            TRUE              FALSE        FALSE              
101         4432 FALSE            TRUE              FALSE        FALSE              
102         4506 FALSE            TRUE              FALSE        FALSE              
103         4579 FALSE            TRUE              FALSE        FALSE              
104         4591 TRUE             TRUE              FALSE        FALSE              
105         4608 FALSE            TRUE              FALSE        FALSE              
106         4731 FALSE            TRUE              FALSE        FALSE              
107         4760 FALSE            TRUE              FALSE        FALSE              
108         4813 FALSE            TRUE              FALSE        FALSE              
109         4960 NA               NA                NA           NA                 
110         4970 FALSE            TRUE              FALSE        FALSE              
111         4983 NA               NA                NA           NA                 
112         5104 FALSE            TRUE              FALSE        FALSE              
113         5163 FALSE            TRUE              FALSE        FALSE              
114         5193 FALSE            TRUE              FALSE        FALSE              
115         5283 FALSE            TRUE              FALSE        FALSE              
116         5323 FALSE            TRUE              FALSE        FALSE              
117         5342 FALSE            TRUE              FALSE        FALSE              
118         5361 NA               NA                NA           NA                 
119         5416 TRUE             FALSE             FALSE        FALSE              
120         5436 FALSE            TRUE              FALSE        FALSE              
121         5463 FALSE            TRUE              FALSE        FALSE              
122         5533 FALSE            TRUE              FALSE        FALSE              
123         5646 FALSE            TRUE              FALSE        FALSE              
124         5773 NA               NA                NA           NA                 
125         5835 FALSE            TRUE              FALSE        FALSE              
126         5922 TRUE             FALSE             FALSE        FALSE              
127         5962 FALSE            TRUE              FALSE        FALSE              
128         5983 FALSE            TRUE              FALSE        FALSE              
129         6054 NA               NA                NA           NA                 
130         6056 NA               NA                NA           NA                 
131         6061 NA               NA                NA           NA                 
132         6064 NA               NA                NA           NA                 
133         6089 NA               NA                NA           NA                 
134         6151 FALSE            TRUE              FALSE        FALSE              
135         6223 NA               NA                NA           NA                 
136         6224 FALSE            FALSE             FALSE        FALSE              
137         6225 FALSE            TRUE              FALSE        FALSE              
138         6226 NA               NA                NA           NA                 
139         6227 FALSE            FALSE             FALSE        FALSE              
140         6228 FALSE            TRUE              FALSE        FALSE              
141         6230 FALSE            TRUE              FALSE        FALSE              
142         6232 FALSE            TRUE              FALSE        FALSE              
143         6288 FALSE            TRUE              FALSE        FALSE              
144         6312 FALSE            TRUE              FALSE        FALSE              
145         6317 FALSE            TRUE              FALSE        FALSE              
146         6339 TRUE             FALSE             FALSE        FALSE              
147         6344 FALSE            TRUE              FALSE        FALSE              
148         6378 TRUE             FALSE             FALSE        FALSE              
149         6402 FALSE            TRUE              FALSE        FALSE              
150         6408 FALSE            TRUE              FALSE        FALSE              
151         6468 FALSE            TRUE              FALSE        FALSE              
152         6516 FALSE            TRUE              FALSE        FALSE              
153         6519 FALSE            TRUE              FALSE        FALSE              
154         6522 FALSE            TRUE              FALSE        TRUE               
155         6526 FALSE            FALSE             FALSE        FALSE              
156         6527 FALSE            TRUE              FALSE        FALSE              
157         6605 FALSE            TRUE              FALSE        FALSE              
158         6671 FALSE            TRUE              FALSE        FALSE              
159         6676 FALSE            TRUE              FALSE        FALSE              
160         6683 FALSE            TRUE              FALSE        FALSE              
161         6686 FALSE            TRUE              FALSE        FALSE              
162         6709 FALSE            TRUE              FALSE        FALSE              
163         6737 NA               NA                NA           NA                 
164         6766 NA               NA                NA           NA                 
165         6775 FALSE            TRUE              FALSE        FALSE              
166         6790 FALSE            TRUE              FALSE        FALSE              
167         6797 FALSE            TRUE              FALSE        FALSE              
168         6905 NA               NA                NA           NA                 
169         6917 TRUE             FALSE             FALSE        FALSE              
170         6970 NA               NA                NA           NA                 
171         6978 NA               NA                NA           NA                 
172         7033 FALSE            TRUE              FALSE        FALSE              
173         7177 NA               NA                NA           NA                 
174         7219 FALSE            TRUE              FALSE        FALSE              
175         7289 TRUE             FALSE             FALSE        FALSE              
176         7294 FALSE            TRUE              FALSE        FALSE              
177         7295 TRUE             FALSE             FALSE        FALSE              
178         7414 TRUE             FALSE             FALSE        FALSE              
179         7429 NA               NA                NA           NA                 
180         7447 FALSE            FALSE             TRUE         FALSE              
181         7456 FALSE            FALSE             TRUE         FALSE              
182         7463 NA               NA                NA           NA                 
183         7464 FALSE            FALSE             TRUE         FALSE              
184         7472 FALSE            FALSE             TRUE         FALSE              
185         7478 FALSE            TRUE              FALSE        FALSE              
186         7481 FALSE            FALSE             TRUE         FALSE              
187         7488 FALSE            TRUE              FALSE        FALSE              
188         7490 FALSE            FALSE             TRUE         FALSE              
189         7500 FALSE            FALSE             TRUE         FALSE              
190         7510 FALSE            FALSE             TRUE         FALSE       

Note that the cumulative score issue (similar to the tiebreaker one) relates to a slight shift in the location of the tables in the "#final_jeopardy_round" nodeset. For example, for game 20 we get the following error:

> library(whatr)
> x = whatr::whatr_html(20)
> whatr::whatr_data(x)
Error in UseMethod("html_table") : 
  no applicable method for 'html_table' applied to an object of class "xml_missing"

As for the triple round error, this largely relates to the recent inclusion of Triple J! Rounds (see, for example, game 7447). For such games, we receive the following error:

> library(whatr)
> x = whatr::whatr_html(7447)
> whatr::whatr_data(x)
Error in `dplyr::mutate()`:
! Problem while computing `round = c(rep(1L, 6), rep(2L, 6), 3L)`.
✖ `round` must be size 19 or 1, not 13.
Run `rlang::last_error()` to see where the error occurred.

Finally, noncharacter scores arise from rare instances where score values (in the showscores tab) do not go above 1,000 (and hence, are not seperated by a comma) and/or do not have a dollar sign in front of the score, thereby leading to an integer (i.e., noncharacter) column. Game 1022 provides such an example:

> library(whatr)
> x = whatr::whatr_html(1022)
> whatr::whatr_data(x)
Error in `pivot_longer_spec()`:
! Can't combine `Bob` <character> and `Dave` <integer>.
Run `rlang::last_error()` to see where the error occurred.

I have been able to tweak the code, in large part inspired by and building upon @john-b-edwards recent work #13. Happy to share these (still fairly new to github and would need to learn a bit more regarding PRs).

Notice that there are a few more errors from either less common edge cases -- i.e., games 3576, 6224, and 6227, all of which correspond to instances of games split across different shows (the Watson game being a notable one) -- as well as games with largely incomplete information (represented by the NAs in the above list), which do not belong to any of the above mentioned issues.

john-b-edwards commented 2 years ago

Ah excellent catch -- I totally forgot that the tournaments had tiebreakers (whereas regular season games did not until 2018).

If you're new to GH, you might have luck picking it up quickly with the GitHub desktop app -- it's fairly intuitive and you can use this link to see how to view and submit PRs.

k5cents commented 2 years ago

Good catches guys! These are some of the annoying edge cases that pop up when trying to make a single scraper to work with thousands of potentially slightly different webpages.

I kind of hate the way I coded this in the first place, so I actually struggle looking into to fix these bugs myself. Will try and give some attention to any PRs that come in. Appreciate the help.