tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.79k stars 2.12k forks source link

summarise_each(funs(median)) behaves erratically; plyr::colwise(median) is stable #1332

Closed shntnu closed 9 years ago

shntnu commented 9 years ago

Get featdata1 from here to reproduce this result: featdata1.rds

this gives a different result each time it is run
> featdata1 %>% dplyr::summarise_each(funs(median))  %>% tidyr::gather(feat, value) %>% dplyr::arrange(value) %>% tail()
                                    feat value
1380 Cytoplasm_Texture_InfoMeas2_DNA_5_0    [[
1381 Cytoplasm_Texture_InfoMeas2_ER_10_0    [[
1382   Cells_Texture_SumEntropy_Mito_5_0    [[
1383  Nuclei_Location_MaxIntensity_X_AGP    [[
1384    Cells_Intensity_MADIntensity_DNA    [[
1385     Cytoplasm_AreaShape_Zernike_9_9    [[

this gives a consistent result
> plyr::colwise(median)(featdata1) %>% tidyr::gather(feat, value) %>% dplyr::arrange(value) %>% tail()
                                               feat        value
1380  Cytoplasm_RadialDistribution_FracAtD_RNA_4of4 4.832181e-15
1381                Nuclei_Texture_Entropy_DNA_10_0 4.877080e-15
1382              Nuclei_Texture_SumEntropy_DNA_3_0 6.090831e-15
1383                         Cells_AreaShape_Extent 7.359503e-15
1384 Cytoplasm_RadialDistribution_MeanFrac_RNA_4of4 8.155040e-15
1385    Nuclei_RadialDistribution_MeanFrac_RNA_3of4 1.300921e-14

> session_info()
Session info --------------------------------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.1 (2015-06-18)
 system   x86_64, darwin13.4.0        
 ui       RStudio (0.99.656)          
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            

Packages ------------------------------------------------------------------------------------------------------------------------------------------------------------------
 package        * version date       source        
 assertthat       0.1     2013-12-06 CRAN (R 3.2.0)
 codetools        0.2-14  2015-07-15 CRAN (R 3.2.0)
 colorspace       1.2-6   2015-03-11 CRAN (R 3.2.0)
 crayon           1.3.1   2015-07-13 CRAN (R 3.2.0)
 curl             0.9.1   2015-07-04 CRAN (R 3.2.0)
 DBI              0.3.1   2014-09-24 CRAN (R 3.2.0)
 devtools       * 1.8.0   2015-05-09 CRAN (R 3.2.0)
 digest           0.6.8   2014-12-31 CRAN (R 3.2.0)
 dplyr          * 0.4.2   2015-06-16 CRAN (R 3.2.0)
 futile.logger    1.4.1   2015-04-20 CRAN (R 3.2.0)
 futile.options   1.0.0   2010-04-06 CRAN (R 3.2.0)
 ggplot2        * 1.0.1   2015-03-17 CRAN (R 3.2.0)
 git2r            0.10.1  2015-05-07 CRAN (R 3.2.0)
 gtable           0.1.2   2012-12-05 CRAN (R 3.2.0)
 jsonlite         0.9.16  2015-04-11 CRAN (R 3.2.0)
 knitr            1.10.5  2015-05-06 CRAN (R 3.2.0)
 lambda.r         1.1.7   2015-03-20 CRAN (R 3.2.0)
 lazyeval         0.1.10  2015-01-02 CRAN (R 3.2.0)
 magrittr       * 1.5     2014-11-22 CRAN (R 3.2.0)
 MASS             7.3-43  2015-07-16 CRAN (R 3.2.0)
 memoise          0.2.1   2014-04-22 CRAN (R 3.2.0)
 munsell          0.4.2   2013-07-11 CRAN (R 3.2.0)
 plyr             1.8.3   2015-06-12 CRAN (R 3.2.0)
 proto            0.3-10  2012-12-22 CRAN (R 3.2.0)
 pryr             0.1.2   2015-06-20 CRAN (R 3.2.0)
 R6               2.1.0   2015-07-04 CRAN (R 3.2.0)
 Rcpp             0.12.0  2015-07-25 CRAN (R 3.2.0)
 reshape2         1.4.1   2014-12-06 CRAN (R 3.2.0)
 roxygen2         4.1.1   2015-04-15 CRAN (R 3.2.0)
 rstudioapi       0.3.1   2015-04-07 CRAN (R 3.2.0)
 rversions        1.0.2   2015-07-13 CRAN (R 3.2.0)
 scales           0.2.5   2015-06-12 CRAN (R 3.2.0)
 stringi          0.5-5   2015-06-29 CRAN (R 3.2.0)
 stringr        * 1.0.0   2015-04-30 CRAN (R 3.2.0)
 testthat       * 0.10.0  2015-05-22 CRAN (R 3.2.0)
 tidyr            0.2.0   2014-12-05 CRAN (R 3.2.0)
 xml2             0.1.1   2015-06-02 CRAN (R 3.2.0)
 yaml             2.1.13  2014-06-12 CRAN (R 3.2.0)
romainfrancois commented 9 years ago

Can you try against the dev version please.

romainfrancois commented 9 years ago

I get consistent results from the dev version.

> featdata1 <- readRDS( "/tmp/featdata1.rds")
>
>
> featdata1 %>% dplyr::summarise_each(funs(median))  %>% tidyr::gather(feat, value) %>% dplyr::arrange(value) %>% tail()
                                               feat        value
1380  Cytoplasm_RadialDistribution_FracAtD_RNA_4of4 4.832181e-15
1381                Nuclei_Texture_Entropy_DNA_10_0 4.877080e-15
1382              Nuclei_Texture_SumEntropy_DNA_3_0 6.090831e-15
1383                         Cells_AreaShape_Extent 7.359503e-15
1384 Cytoplasm_RadialDistribution_MeanFrac_RNA_4of4 8.155040e-15
1385    Nuclei_RadialDistribution_MeanFrac_RNA_3of4 1.300921e-14
> featdata1 %>% dplyr::summarise_each(funs(median))  %>% tidyr::gather(feat, value) %>% dplyr::arrange(value) %>% tail()
                                               feat        value
1380  Cytoplasm_RadialDistribution_FracAtD_RNA_4of4 4.832181e-15
1381                Nuclei_Texture_Entropy_DNA_10_0 4.877080e-15
1382              Nuclei_Texture_SumEntropy_DNA_3_0 6.090831e-15
1383                         Cells_AreaShape_Extent 7.359503e-15
1384 Cytoplasm_RadialDistribution_MeanFrac_RNA_4of4 8.155040e-15
1385    Nuclei_RadialDistribution_MeanFrac_RNA_3of4 1.300921e-14
> featdata1 %>% dplyr::summarise_each(funs(median))  %>% tidyr::gather(feat, value) %>% dplyr::arrange(value) %>% tail()
                                               feat        value
1380  Cytoplasm_RadialDistribution_FracAtD_RNA_4of4 4.832181e-15
1381                Nuclei_Texture_Entropy_DNA_10_0 4.877080e-15
1382              Nuclei_Texture_SumEntropy_DNA_3_0 6.090831e-15
1383                         Cells_AreaShape_Extent 7.359503e-15
1384 Cytoplasm_RadialDistribution_MeanFrac_RNA_4of4 8.155040e-15
1385    Nuclei_RadialDistribution_MeanFrac_RNA_3of4 1.300921e-14
> featdata1 %>% dplyr::summarise_each(funs(median))  %>% tidyr::gather(feat, value) %>% dplyr::arrange(value) %>% tail()
                                               feat        value
1380  Cytoplasm_RadialDistribution_FracAtD_RNA_4of4 4.832181e-15
1381                Nuclei_Texture_Entropy_DNA_10_0 4.877080e-15
1382              Nuclei_Texture_SumEntropy_DNA_3_0 6.090831e-15
1383                         Cells_AreaShape_Extent 7.359503e-15
1384 Cytoplasm_RadialDistribution_MeanFrac_RNA_4of4 8.155040e-15
1385    Nuclei_RadialDistribution_MeanFrac_RNA_3of4 1.300921e-14
> featdata1 %>% dplyr::summarise_each(funs(median))  %>% tidyr::gather(feat, value) %>% dplyr::arrange(value) %>% tail()
                                               feat        value
1380  Cytoplasm_RadialDistribution_FracAtD_RNA_4of4 4.832181e-15
1381                Nuclei_Texture_Entropy_DNA_10_0 4.877080e-15
1382              Nuclei_Texture_SumEntropy_DNA_3_0 6.090831e-15
1383                         Cells_AreaShape_Extent 7.359503e-15
1384 Cytoplasm_RadialDistribution_MeanFrac_RNA_4of4 8.155040e-15
1385    Nuclei_RadialDistribution_MeanFrac_RNA_3of4 1.300921e-14

Please try the dev version and reopen if you still have the problem.

shntnu commented 9 years ago

Works now, thanks!

> featdata1 %>% dplyr::summarise_each(dplyr::funs(median))  %>% tidyr::gather(feat, value) %>% dplyr::arrange(value) %>% tail()
                                               feat        value
1380  Cytoplasm_RadialDistribution_FracAtD_RNA_4of4 4.832181e-15
1381                Nuclei_Texture_Entropy_DNA_10_0 4.877080e-15
1382              Nuclei_Texture_SumEntropy_DNA_3_0 6.090831e-15
1383                         Cells_AreaShape_Extent 7.359503e-15
1384 Cytoplasm_RadialDistribution_MeanFrac_RNA_4of4 8.155040e-15
1385    Nuclei_RadialDistribution_MeanFrac_RNA_3of4 1.300921e-14
> featdata1 %>% dplyr::summarise_each(dplyr::funs(median))  %>% tidyr::gather(feat, value) %>% dplyr::arrange(value) %>% tail()
                                               feat        value
1380  Cytoplasm_RadialDistribution_FracAtD_RNA_4of4 4.832181e-15
1381                Nuclei_Texture_Entropy_DNA_10_0 4.877080e-15
1382              Nuclei_Texture_SumEntropy_DNA_3_0 6.090831e-15
1383                         Cells_AreaShape_Extent 7.359503e-15
1384 Cytoplasm_RadialDistribution_MeanFrac_RNA_4of4 8.155040e-15
1385    Nuclei_RadialDistribution_MeanFrac_RNA_3of4 1.300921e-14
> featdata1 %>% dplyr::summarise_each(dplyr::funs(median))  %>% tidyr::gather(feat, value) %>% dplyr::arrange(value) %>% tail()
                                               feat        value
1380  Cytoplasm_RadialDistribution_FracAtD_RNA_4of4 4.832181e-15
1381                Nuclei_Texture_Entropy_DNA_10_0 4.877080e-15
1382              Nuclei_Texture_SumEntropy_DNA_3_0 6.090831e-15
1383                         Cells_AreaShape_Extent 7.359503e-15
1384 Cytoplasm_RadialDistribution_MeanFrac_RNA_4of4 8.155040e-15
1385    Nuclei_RadialDistribution_MeanFrac_RNA_3of4 1.300921e-14

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_1.5

loaded via a namespace (and not attached):
 [1] plyr_1.8.3           lazyeval_0.1.10.9000 R6_2.1.1             assertthat_0.1       parallel_3.2.1       DBI_0.3.1            tools_3.2.1         
 [8] reshape2_1.4.1       dplyr_0.4.2.9002     Rcpp_0.12.0          stringi_0.5-5        stringr_1.0.0        tidyr_0.2.0
shntnu commented 9 years ago

@romainfrancois another odd behavior, this time with mutate_each, possibly related? wrapping featdata as a tbl_df seems to fix the problem

> featdata1 %>% dplyr::tbl_df() %>% dplyr::mutate_each(dplyr::funs(scale))  %>% dplyr::summarise_each(dplyr::funs(median))  %>% tidyr::gather(feat, value) %>% dplyr::arrange(value) %>% tail()
Source: local data frame [6 x 2]

                                            feat     value
1 Cells_Neighbors_AngleBetweenNeighbors_Adjacent 0.2758451
2               Cells_Texture_SumEntropy_RNA_5_0 0.2801385
3               Cells_Texture_InfoMeas1_DNA_10_0 0.2829963
4           Cytoplasm_Texture_SumEntropy_RNA_3_0 0.2975776
5            Cytoplasm_Texture_InfoMeas1_ER_10_0 0.3183102
6           Cytoplasm_Texture_InfoMeas1_DNA_10_0 0.3569359
> featdata1  %>% dplyr::mutate_each(dplyr::funs(scale))  %>% dplyr::summarise_each(dplyr::funs(median))  %>% tidyr::gather(feat, value) %>% dplyr::arrange(value) %>% tail()
Error: data_frames can only contain 1d atomic vectors and lists
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_1.5

loaded via a namespace (and not attached):
 [1] plyr_1.8.3           lazyeval_0.1.10.9000 R6_2.1.1             assertthat_0.1       parallel_3.2.1       DBI_0.3.1            tools_3.2.1         
 [8] reshape2_1.4.1       dplyr_0.4.2.9002     Rcpp_0.12.0          stringi_0.5-5        stringr_1.0.0        tidyr_0.2.0         
>
shntnu commented 9 years ago

@romainfrancois Another instance of the same problem:

> featdata1  %>% dplyr::select(1:2) %>% dplyr::mutate_each(dplyr::funs(scale))  %>% dplyr::summarise_each(dplyr::funs(median))
Error: data_frames can only contain 1d atomic vectors and lists
> featdata1  %>% dplyr::select(1:2) %>% dplyr::mutate_each(dplyr::funs(scale))  %>% {plyr::colwise(median)(.)}
  Nuclei_Texture_InverseDifferenceMoment_Mito_5_0 Cytoplasm_Intensity_MADIntensity_Mito
1                                      0.07450019                            -0.1938028
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_1.5

loaded via a namespace (and not attached):
[1] plyr_1.8.3           lazyeval_0.1.10.9000 R6_2.1.1             assertthat_0.1       parallel_3.2.1       DBI_0.3.1            tools_3.2.1          dplyr_0.4.2.9002    
[9] Rcpp_0.12.0        
hadley commented 9 years ago

scale() makes a matrix...

shntnu commented 9 years ago

@hadley I missed that - thanks! This makes a case for using dplyr::tbl_df for more than just pretty printing? (Note that tbl_df(featdata1) instead of just featdata1 makes it ok to use scale in the way I have above)