rfordatascience / tidytuesday

Official repo for the #tidytuesday project
Creative Commons Zero v1.0 Universal
6.78k stars 2.39k forks source link

Missing Columns in Week 39 Olympiad Data: country_results_df missing `total` and `rank` #757

Closed nicolasfoss closed 4 days ago

nicolasfoss commented 4 days ago

Hello, Jon:

First, thanks for all the work you put into these. I found a missing column in the country_results_df.

In your README, you show the following columns for country_results_df (see below), but columns 13-14 do not show up in the file we can import for #tidytuesday

country_results_df.csv

column_number variable class description
1 year integer Year of IMO
2 country character Participating country
3 team_size_all integer Participating contestants
4 team_size_male integer Male contestants
5 team_size_female integer Female contestants
6 p1 integer Score on problem 1
7 p2 integer Score on problem 2
8 p3 integer Score on problem 3
9 p4 integer Score on problem 4
10 p5 integer Score on problem 5
11 p6 integer Score on problem 6
12 p7 integer Score on problem 7
13 total integer Total score on all problems
14 rank integer Country rank
15 awards_gold integer Number of gold medals
16 awards_silver integer Number of silver medals
17 awards_bronze integer Number of bronze medals
18 awards_honorable_mentions integer Number of honorable mentions
19 leader character Leader of country team
20 deputy_leader character Deputy leader of country team

You will notice that the total and rank columns are missing below.

# Load necessary libraries
library(tidyverse)
library(tidytuesdayR)

# Download the data for 2024-09-24 #TidyTuesday
tuesdata <- tidytuesdayR::tt_load('2024-09-24')

# Load the country_results_df dataset
country_results_df <- tuesdata$country_results_df

# Check the structure of the data
country_results_df %>% 
  glimpse()
#> Rows: 3,780
#> Columns: 18
#> $ year                      <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 20…
#> $ country                   <chr> "United States of America", "People's Republ…
#> $ team_size_all             <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
#> $ team_size_male            <dbl> 5, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 5, 5, 5, 6,…
#> $ team_size_female          <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,…
#> $ p1                        <dbl> 42, 42, 42, 42, 42, 42, 42, 42, 42, 38, 42, …
#> $ p2                        <dbl> 41, 42, 37, 34, 30, 37, 33, 37, 25, 37, 33, …
#> $ p3                        <dbl> 19, 31, 18, 11, 10, 7, 8, 16, 5, 5, 6, 2, 3,…
#> $ p4                        <dbl> 40, 40, 42, 42, 42, 42, 42, 36, 42, 42, 42, …
#> $ p5                        <dbl> 35, 22, 7, 28, 36, 29, 31, 23, 35, 12, 23, 1…
#> $ p6                        <dbl> 15, 13, 22, 10, 5, 5, 6, 1, 2, 17, 3, 9, 3, …
#> $ p7                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ awards_gold               <dbl> 5, 5, 2, 4, 4, 1, 2, 2, 1, 2, 2, 1, 3, 1, 2,…
#> $ awards_silver             <dbl> 1, 1, 4, 1, 0, 5, 3, 3, 4, 2, 2, 4, 1, 3, 2,…
#> $ awards_bronze             <dbl> 0, 0, 0, 0, 2, 0, 1, 1, 1, 2, 2, 1, 2, 2, 1,…
#> $ awards_honorable_mentions <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
#> $ leader                    <chr> "John Berman", "Liang Xiao", "Suyoung Choi",…
#> $ deputy_leader             <chr> "Carl Schildkraut", "Yijun Yao", "Hwajong Yo…

Below is a snippet from the cleaning script in the README, where the total and rank columns are not selected within the select() statement.

### NOT RUN {
# country_results_df <- map_df(timeline_df$year, scrape_country) %>% 
#  select( 
#    year,
#    country,
#    team_size_all = team_size,
#    team_size_male = team_size_2,
#    team_size_female = team_size_3,
#    starts_with("p"), # <--- `total` and then `rank` are not included here
#    awards_gold = awards,
#    awards_silver = awards_2,
#    awards_bronze = awards_3,
#    awards_honorable_mentions = awards_4,
#    leader,
#    deputy_leader
#  ) %>% 
#  mutate(
#    across(
#      c(team_size_all:awards_honorable_mentions),
#      as.integer
#    )
#  )}

Here, I will propose a fix, it is simply an update to the select() statement:

# re-run the cleaning script

library(tidyverse)
library(rvest)
library(janitor)
library(httr2)

  timeline_df <- read_html("https://www.imo-official.org/organizers.aspx") %>%
    html_table() %>%
    .[[1]] %>%
    clean_names() %>%
    rename(
      "all_contestant" = contestants,
      "male_contestant" = contestants_2,
      "female_contestant" = contestants_3,
      "edition" = number
    ) %>%
    filter(edition != "#") %>%
    mutate(
      start_date = paste0(gsub("(.*)(-)(.*)", "\\1", date),year),
      end_date = paste0(gsub("(.*)(-)(.*)", "\\3", date),year),
      across(
        c(start_date, end_date),
        ~as.Date(.x, format = "%d.%m.%Y")
      ),
      across(
        c(edition, year, countries, all_contestant, male_contestant, female_contestant),
        as.integer
      )
    ) %>%
    select(-date) %>%
    # only keeping records till current year
    filter(year < 2025)

  # circulate through country results link and rbind tables
  scrape_country <- function(year) {
    paste0("https://www.imo-official.org/year_country_r.aspx?year=", year) %>%
      read_html() %>%
      html_table() %>%
      .[[1]] %>%
      clean_names() %>%
      filter(country != "Country") %>%
      mutate(year = year, .before = "country") 
  }

After the initial setup, we can manipulate the data and select total and rank along with the rest of the good stuff.

# try an updated select() statement
country_results_df <- map_df(timeline_df$year, scrape_country) %>%
  select(
    year,
    country,
    team_size_all = team_size,
    team_size_male = team_size_2,
    team_size_female = team_size_3,
    starts_with("p"),
    total, # <--- total included
    rank, # <--- rank included
    awards_gold = awards,
    awards_silver = awards_2,
    awards_bronze = awards_3,
    awards_honorable_mentions = awards_4,
    leader,
    deputy_leader
  ) %>% 
  mutate(
    across(
      c(team_size_all:awards_honorable_mentions),
      as.integer
    )
  )
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(c(team_size_all:awards_honorable_mentions),
#>   as.integer)`.
#> Caused by warning:
#> ! NAs introduced by coercion

We can see below that total and rank are included, and we are off to the...math competition!

# check

country_results_df %>% 
  glimpse() # total and rank are included
#> Rows: 3,780
#> Columns: 20
#> $ year                      <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 20…
#> $ country                   <chr> "United States of America", "People's Republ…
#> $ team_size_all             <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
#> $ team_size_male            <int> 5, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 5, 5, 5, 6,…
#> $ team_size_female          <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,…
#> $ p1                        <int> 42, 42, 42, 42, 42, 42, 42, 42, 42, 38, 42, …
#> $ p2                        <int> 41, 42, 37, 34, 30, 37, 33, 37, 25, 37, 33, …
#> $ p3                        <int> 19, 31, 18, 11, 10, 7, 8, 16, 5, 5, 6, 2, 3,…
#> $ p4                        <int> 40, 40, 42, 42, 42, 42, 42, 36, 42, 42, 42, …
#> $ p5                        <int> 35, 22, 7, 28, 36, 29, 31, 23, 35, 12, 23, 1…
#> $ p6                        <int> 15, 13, 22, 10, 5, 5, 6, 1, 2, 17, 3, 9, 3, …
#> $ p7                        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ total                     <int> 192, 190, 168, 167, 165, 162, 162, 155, 151,…
#> $ rank                      <int> 1, 2, 3, 4, 5, 6, 6, 8, 9, 9, 11, 12, 13, 14…
#> $ awards_gold               <int> 5, 5, 2, 4, 4, 1, 2, 2, 1, 2, 2, 1, 3, 1, 2,…
#> $ awards_silver             <int> 1, 1, 4, 1, 0, 5, 3, 3, 4, 2, 2, 4, 1, 3, 2,…
#> $ awards_bronze             <int> 0, 0, 0, 0, 2, 0, 1, 1, 1, 2, 2, 1, 2, 2, 1,…
#> $ awards_honorable_mentions <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
#> $ leader                    <chr> "John Berman", "Liang Xiao", "Suyoung Choi",…
#> $ deputy_leader             <chr> "Carl Schildkraut", "Yijun Yao", "Hwajong Yo…

Thanks again for all your work on the #tidytuesday project, it is such a great opportunity to learn and grow with others.

Created on 2024-09-25 with reprex v2.1.1

jonthegeek commented 4 days ago

Since people have already been using the data for a couple days, I remove the column from the dictionary. Thanks for the catch, but changing things can break a lot of people's code, so we try to leave things alone as much as possible once a dataset is released!

nicolasfoss commented 4 days ago

@jonthegeek Sounds good, I will just get that added on my end and use in my analysis.

nicolasfoss commented 4 days ago

@jonthegeek Sorry to be a bother, but I did see you pushed a fix to the readme.md to remove rank, you might also remove total as I updated my issue a few times as I caught that one as well. I just want to be helpful and am thankful for your responsiveness. Have a good one!