Open john-b-edwards opened 2 years ago
Nice find! whatr_data()
is really just a list of all the other functions put together.
list(
info = whatr_airdate(showgame),
summary = whatr_synopsis(showgame),
players = whatr_players(showgame),
scores = whatr_scores(showscores),
board = whatr_board(showgame)
)
So it must be one/all of those that is breaking.
Given the rarity of this bug, I can't say I will solve it myself very soon.
But anybody should feel free to dig into the code and submit a PR with any fixes!
Hi both,
I came across parsing issues when using whatr_data()
recently as well and took a deeper look into what was causing them as well as the actual number of occurrences across all games. In fact, I found 14 games which had tiebreakers (as of 25 November). However, beyond this particular issue I also came across several other ones producing potential parsing issues. In particular, the following list highlights the three other main issues:
> missing_games %>%
+ tibble::as_tibble()
# A tibble: 190 × 5
missing_game tiebreaker_round cumulative_scores triple_round noncharacter_scores
<int> <lgl> <lgl> <lgl> <lgl>
1 20 FALSE TRUE FALSE FALSE
2 57 FALSE TRUE FALSE FALSE
3 115 FALSE TRUE FALSE FALSE
4 124 FALSE TRUE FALSE FALSE
5 141 FALSE TRUE FALSE FALSE
6 156 FALSE TRUE FALSE FALSE
7 182 FALSE TRUE FALSE FALSE
8 195 FALSE TRUE FALSE FALSE
9 223 FALSE TRUE FALSE FALSE
10 315 FALSE TRUE FALSE FALSE
11 317 FALSE TRUE FALSE FALSE
12 322 FALSE TRUE FALSE FALSE
13 324 FALSE TRUE FALSE FALSE
14 326 FALSE TRUE FALSE FALSE
15 329 FALSE TRUE FALSE FALSE
16 330 FALSE TRUE FALSE FALSE
17 420 FALSE TRUE FALSE FALSE
18 485 FALSE TRUE FALSE FALSE
19 500 FALSE TRUE FALSE FALSE
20 566 FALSE TRUE FALSE FALSE
21 616 FALSE TRUE FALSE FALSE
22 629 FALSE TRUE FALSE FALSE
23 673 FALSE TRUE FALSE FALSE
24 685 FALSE TRUE FALSE FALSE
25 737 FALSE TRUE FALSE FALSE
26 787 FALSE TRUE FALSE FALSE
27 788 FALSE TRUE FALSE FALSE
28 902 FALSE TRUE FALSE FALSE
29 971 FALSE TRUE FALSE FALSE
30 1022 FALSE FALSE FALSE TRUE
31 1037 FALSE TRUE FALSE FALSE
32 1129 FALSE TRUE FALSE FALSE
33 1132 NA NA NA NA
34 1133 NA NA NA NA
35 1134 NA NA NA NA
36 1135 NA NA NA NA
37 1152 TRUE TRUE FALSE FALSE
38 1153 NA NA NA NA
39 1271 FALSE TRUE FALSE FALSE
40 1305 FALSE TRUE FALSE FALSE
41 1348 FALSE FALSE FALSE TRUE
42 1361 FALSE TRUE FALSE FALSE
43 1421 FALSE TRUE FALSE FALSE
44 1430 FALSE TRUE FALSE FALSE
45 1440 FALSE TRUE FALSE FALSE
46 1473 TRUE FALSE FALSE FALSE
47 1477 FALSE TRUE FALSE FALSE
48 1538 FALSE TRUE FALSE FALSE
49 1592 FALSE TRUE FALSE FALSE
50 1689 FALSE TRUE FALSE FALSE
51 1755 FALSE TRUE FALSE FALSE
52 1851 FALSE TRUE FALSE FALSE
53 1989 FALSE TRUE FALSE FALSE
54 1999 FALSE TRUE FALSE FALSE
55 2140 FALSE TRUE FALSE FALSE
56 2172 TRUE FALSE FALSE FALSE
57 2175 FALSE TRUE FALSE FALSE
58 2298 FALSE TRUE FALSE FALSE
59 2347 FALSE TRUE FALSE FALSE
60 2389 FALSE TRUE FALSE FALSE
61 2464 FALSE TRUE FALSE FALSE
62 2471 FALSE TRUE FALSE FALSE
63 2481 FALSE TRUE FALSE FALSE
64 2536 FALSE TRUE FALSE FALSE
65 2585 FALSE TRUE FALSE FALSE
66 2792 FALSE TRUE FALSE FALSE
67 2927 FALSE TRUE FALSE FALSE
68 2950 FALSE TRUE FALSE FALSE
69 2966 FALSE TRUE FALSE FALSE
70 3008 FALSE TRUE FALSE FALSE
71 3081 TRUE FALSE FALSE FALSE
72 3213 FALSE TRUE FALSE FALSE
73 3314 FALSE TRUE FALSE FALSE
74 3386 FALSE TRUE FALSE FALSE
75 3396 FALSE TRUE FALSE FALSE
76 3508 FALSE TRUE FALSE FALSE
77 3575 NA NA NA NA
78 3576 FALSE FALSE FALSE FALSE
79 3577 FALSE TRUE FALSE FALSE
80 3588 FALSE TRUE FALSE FALSE
81 3644 FALSE TRUE FALSE FALSE
82 3760 FALSE TRUE FALSE FALSE
83 3828 FALSE TRUE FALSE FALSE
84 3838 FALSE TRUE FALSE FALSE
85 3889 TRUE FALSE FALSE FALSE
86 3893 FALSE TRUE FALSE FALSE
87 4017 FALSE TRUE FALSE FALSE
88 4077 FALSE TRUE FALSE FALSE
89 4092 FALSE TRUE FALSE FALSE
90 4099 FALSE TRUE FALSE FALSE
91 4115 FALSE TRUE FALSE FALSE
92 4183 FALSE TRUE FALSE FALSE
93 4186 FALSE TRUE FALSE FALSE
94 4189 FALSE TRUE FALSE FALSE
95 4256 NA NA NA NA
96 4264 NA NA NA NA
97 4271 NA NA NA NA
98 4273 NA NA NA NA
99 4284 NA NA NA NA
100 4357 FALSE TRUE FALSE FALSE
101 4432 FALSE TRUE FALSE FALSE
102 4506 FALSE TRUE FALSE FALSE
103 4579 FALSE TRUE FALSE FALSE
104 4591 TRUE TRUE FALSE FALSE
105 4608 FALSE TRUE FALSE FALSE
106 4731 FALSE TRUE FALSE FALSE
107 4760 FALSE TRUE FALSE FALSE
108 4813 FALSE TRUE FALSE FALSE
109 4960 NA NA NA NA
110 4970 FALSE TRUE FALSE FALSE
111 4983 NA NA NA NA
112 5104 FALSE TRUE FALSE FALSE
113 5163 FALSE TRUE FALSE FALSE
114 5193 FALSE TRUE FALSE FALSE
115 5283 FALSE TRUE FALSE FALSE
116 5323 FALSE TRUE FALSE FALSE
117 5342 FALSE TRUE FALSE FALSE
118 5361 NA NA NA NA
119 5416 TRUE FALSE FALSE FALSE
120 5436 FALSE TRUE FALSE FALSE
121 5463 FALSE TRUE FALSE FALSE
122 5533 FALSE TRUE FALSE FALSE
123 5646 FALSE TRUE FALSE FALSE
124 5773 NA NA NA NA
125 5835 FALSE TRUE FALSE FALSE
126 5922 TRUE FALSE FALSE FALSE
127 5962 FALSE TRUE FALSE FALSE
128 5983 FALSE TRUE FALSE FALSE
129 6054 NA NA NA NA
130 6056 NA NA NA NA
131 6061 NA NA NA NA
132 6064 NA NA NA NA
133 6089 NA NA NA NA
134 6151 FALSE TRUE FALSE FALSE
135 6223 NA NA NA NA
136 6224 FALSE FALSE FALSE FALSE
137 6225 FALSE TRUE FALSE FALSE
138 6226 NA NA NA NA
139 6227 FALSE FALSE FALSE FALSE
140 6228 FALSE TRUE FALSE FALSE
141 6230 FALSE TRUE FALSE FALSE
142 6232 FALSE TRUE FALSE FALSE
143 6288 FALSE TRUE FALSE FALSE
144 6312 FALSE TRUE FALSE FALSE
145 6317 FALSE TRUE FALSE FALSE
146 6339 TRUE FALSE FALSE FALSE
147 6344 FALSE TRUE FALSE FALSE
148 6378 TRUE FALSE FALSE FALSE
149 6402 FALSE TRUE FALSE FALSE
150 6408 FALSE TRUE FALSE FALSE
151 6468 FALSE TRUE FALSE FALSE
152 6516 FALSE TRUE FALSE FALSE
153 6519 FALSE TRUE FALSE FALSE
154 6522 FALSE TRUE FALSE TRUE
155 6526 FALSE FALSE FALSE FALSE
156 6527 FALSE TRUE FALSE FALSE
157 6605 FALSE TRUE FALSE FALSE
158 6671 FALSE TRUE FALSE FALSE
159 6676 FALSE TRUE FALSE FALSE
160 6683 FALSE TRUE FALSE FALSE
161 6686 FALSE TRUE FALSE FALSE
162 6709 FALSE TRUE FALSE FALSE
163 6737 NA NA NA NA
164 6766 NA NA NA NA
165 6775 FALSE TRUE FALSE FALSE
166 6790 FALSE TRUE FALSE FALSE
167 6797 FALSE TRUE FALSE FALSE
168 6905 NA NA NA NA
169 6917 TRUE FALSE FALSE FALSE
170 6970 NA NA NA NA
171 6978 NA NA NA NA
172 7033 FALSE TRUE FALSE FALSE
173 7177 NA NA NA NA
174 7219 FALSE TRUE FALSE FALSE
175 7289 TRUE FALSE FALSE FALSE
176 7294 FALSE TRUE FALSE FALSE
177 7295 TRUE FALSE FALSE FALSE
178 7414 TRUE FALSE FALSE FALSE
179 7429 NA NA NA NA
180 7447 FALSE FALSE TRUE FALSE
181 7456 FALSE FALSE TRUE FALSE
182 7463 NA NA NA NA
183 7464 FALSE FALSE TRUE FALSE
184 7472 FALSE FALSE TRUE FALSE
185 7478 FALSE TRUE FALSE FALSE
186 7481 FALSE FALSE TRUE FALSE
187 7488 FALSE TRUE FALSE FALSE
188 7490 FALSE FALSE TRUE FALSE
189 7500 FALSE FALSE TRUE FALSE
190 7510 FALSE FALSE TRUE FALSE
Note that the cumulative score issue (similar to the tiebreaker one) relates to a slight shift in the location of the tables in the "#final_jeopardy_round" nodeset. For example, for game 20 we get the following error:
> library(whatr)
> x = whatr::whatr_html(20)
> whatr::whatr_data(x)
Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_missing"
As for the triple round error, this largely relates to the recent inclusion of Triple J! Rounds (see, for example, game 7447). For such games, we receive the following error:
> library(whatr)
> x = whatr::whatr_html(7447)
> whatr::whatr_data(x)
Error in `dplyr::mutate()`:
! Problem while computing `round = c(rep(1L, 6), rep(2L, 6), 3L)`.
✖ `round` must be size 19 or 1, not 13.
Run `rlang::last_error()` to see where the error occurred.
Finally, noncharacter scores arise from rare instances where score values (in the showscores tab) do not go above 1,000 (and hence, are not seperated by a comma) and/or do not have a dollar sign in front of the score, thereby leading to an integer (i.e., noncharacter) column. Game 1022 provides such an example:
> library(whatr)
> x = whatr::whatr_html(1022)
> whatr::whatr_data(x)
Error in `pivot_longer_spec()`:
! Can't combine `Bob` <character> and `Dave` <integer>.
Run `rlang::last_error()` to see where the error occurred.
I have been able to tweak the code, in large part inspired by and building upon @john-b-edwards recent work #13. Happy to share these (still fairly new to github and would need to learn a bit more regarding PRs).
Notice that there are a few more errors from either less common edge cases -- i.e., games 3576, 6224, and 6227, all of which correspond to instances of games split across different shows (the Watson game being a notable one) -- as well as games with largely incomplete information (represented by the NAs in the above list), which do not belong to any of the above mentioned issues.
Ah excellent catch -- I totally forgot that the tournaments had tiebreakers (whereas regular season games did not until 2018).
If you're new to GH, you might have luck picking it up quickly with the GitHub desktop app -- it's fairly intuitive and you can use this link to see how to view and submit PRs.
Good catches guys! These are some of the annoying edge cases that pop up when trying to make a single scraper to work with thousands of potentially slightly different webpages.
I kind of hate the way I coded this in the first place, so I actually struggle looking into to fix these bugs myself. Will try and give some attention to any PRs that come in. Appreciate the help.
Whenever a game goes to a tiebreaker (five such instances since 2018, so fairly rare),
whatr_data()
cannot parse information about the board/game state. This is a fairly niche edge case albeit but still one that breaks the existing code.Tiebreaker games seem to affect the following functions in
{whatr}
in addition towhatr_data()
:whatr_answers()
whatr_board()
whatr_categories()
whatr_clues()
whatr_doubles()
whatr_plot()
whatr_scores()
whatr_synopsis()
Other
whatr_*
functions seem unaffected by this edge case.Based on the J! Archive, believe this is the list of all tiebreaker games: