microsoft / datamations

https://microsoft.github.io/datamations/
Other
67 stars 14 forks source link

Fix small salary data #50

Closed sharlagelfand closed 3 years ago

sharlagelfand commented 3 years ago

Looks like there's actually two small salary data sets, which give different results:

library(dplyr)
library(datamations)

small_salary
#> # A tibble: 100 x 6
#>       ID Degree  Work     Salary i     order
#>    <int> <fct>   <fct>     <dbl> <chr> <int>
#>  1    22 Masters Academia   81.9 id        1
#>  2    96 PhD     Academia   84.5 id        2
#>  3    10 Masters Academia   82.9 id        3
#>  4    42 PhD     Academia   83.8 id        4
#>  5    55 PhD     Academia   83.8 id        5
#>  6    14 PhD     Academia   85.3 id        6
#>  7    33 PhD     Industry   91.4 id        7
#>  8   100 PhD     Academia   85.3 id        8
#>  9    57 Masters Academia   83.3 id        9
#> 10     2 PhD     Industry   92.3 id       10
#> # … with 90 more rows

small_salary %>% 
  group_by(Degree) %>%
  summarise(mean = mean(Salary))
#> # A tibble: 2 x 2
#>   Degree   mean
#>   <fct>   <dbl>
#> 1 Masters  90.2
#> 2 PhD      88.2

small_salary_data
#> # A tibble: 30 x 3
#>    Degree  Work     Salary
#>    <chr>   <chr>     <dbl>
#>  1 Masters Industry     86
#>  2 Masters Academia     71
#>  3 PhD     Industry    104
#>  4 Masters Industry     94
#>  5 Masters Academia     93
#>  6 Masters Academia     96
#>  7 PhD     Academia    100
#>  8 Masters Industry     86
#>  9 PhD     Academia     80
#> 10 Masters Industry     85
#> # … with 20 more rows

small_salary_data %>%
  group_by(Degree) %>% 
  summarise(mean = mean(Salary))
#> # A tibble: 2 x 2
#>   Degree   mean
#>   <chr>   <dbl>
#> 1 Masters  90.6
#> 2 PhD      92.1

@jhofman can you confirm that the one we want is the first, with means 90.2 and 88.2?

sharlagelfand commented 3 years ago

Just to update, we want to use salary_data, not small_salary

n = 3000 in this case and I'm running into performance issues - seems that the plotting itself is fine, but gemini is struggling with animating this much data:

https://user-images.githubusercontent.com/15895337/118666335-17c09400-b7c1-11eb-80f7-0d4030013901.mov

sharlagelfand commented 3 years ago

We want to use small_salary - but should think more about downsampling (either telling them to do it or doing it ourselves with a warning) - I'll open a new issue for that

sharlagelfand commented 3 years ago

small_salary is being used in my branch, so will close this now!