tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.78k stars 2.12k forks source link

Inconsistent results using rowwise() + mutate + c_across() to calculate sd(), median() and mean(): the calulation breaks if sd() is caclulated first? #7092

Closed ChristianRohde closed 1 month ago

ChristianRohde commented 1 month ago

Hi I want to calculate sd(), mean, median and modified your example from c_across in Package dplyr version 1.1.4. In particular I want to make the calculation fail-safe in case I will only have a single column. In such case I would expect a result for mean and median, while sd() probably should fail and the result is NA. However, when I put sd() in front of the other calculations within a mutate(), this breaks all calculations. This is not what i expected. I would expect to get NA only for the sd() column.

df <- tibble(id = paste0("col",1:4), w = runif(4))

df %>%
  rowwise() %>%
  mutate(
    sd = stats::sd(c_across(where(is.numeric))),
    mean = base::mean(c_across(where(is.numeric))),
    median = stats::median(c_across(where(is.numeric)))
  ) %>%
  ungroup()

# A tibble: 4 × 5
  id        w    sd  mean median
  <chr> <dbl> <dbl> <dbl>  <dbl>
1 col1  0.791    NA    NA     NA
2 col2  0.516    NA    NA     NA
3 col3  0.995    NA    NA     NA
4 col4  0.893    NA    NA     NA

Interestingly the calculation somehow works if sd() is shifted to the end. Moreover, in such case I do get a 0 instead of a NA, which might be OK, although actually I would expect an NA. I am not sure what is happening here:


df %>%
  rowwise() %>%
  mutate(
    mean = base::mean(c_across(where(is.numeric))),
    median = stats::median(c_across(where(is.numeric))),
    sd = stats::sd(c_across(where(is.numeric)))
  ) %>%
  ungroup()

# A tibble: 4 × 5
  id        w  mean median    sd
  <chr> <dbl> <dbl>  <dbl> <dbl>
1 col1  0.791 0.791  0.791     0
2 col2  0.516 0.516  0.516     0
3 col3  0.995 0.995  0.995     0
4 col4  0.893 0.893  0.893     0

I know that my question is a bit artificial and I could omit such scenario. On the other hand it would be nice if I still would get consistent results which I do understand even for such case. Does someone have a explanation? Answers from AI are running in circles :D

Best, Christian

ChristianRohde commented 1 month ago

I was a bit surprised to see the results, since it tells me that technically within the same mutate() statement one calculation is done after the other. My second example implies that after I calculate mean on the only numeric column "w" the next calculation of the median will be done on the columns c("w","mean"). Finally, the sd() calculation is done on the columns c("w","mean","median"). It totally makes sense that this results in 0 and not NA, since there is zero variation. Also my first example now totally makes sense, since I did not include na.rm == TRUE. All following calculations will be NA, since the sd column result is NA. Here I have to admit that my understanding how dplyr::rowwise in combination with dplyr::mutate() and dplyr::c_across was different before. I would have thought that I would calculate either mean, median and sd on the numeric columns of the initial table. I apologize for the confusion and the report of the issue.