tidyverse / readxl

Read excel files (.xls and .xlsx) into R 🖇
https://readxl.tidyverse.org
Other
729 stars 194 forks source link

Potential memory leak with read_excel() on Windows 7? #545

Open matthiasgomolka opened 5 years ago

matthiasgomolka commented 5 years ago

I read a few larger Excel files (several hundreds of MB's) and noticed that the RAM usage almost instantly jumped to 100%. This is unexpected since that machine has 32GB of RAM and only ~ 5GB were used before calling read_excel().

Thus, I reproduced the problem on a smaller scale. The example xlsx file is attached in the zip file. When I read that file with r readxl::read_excel("readxl_test.xlsx") I see in the Windows task manager that RAM usage jumps about 2GB higher during the process. However, profiling the call shows that only ~ 240 MB were used. For small files, this is no problem, for large ones, it is.

Here is the reprex including the sessionInfo() ...

set.seed(20190130)

# generate example data
df <- data.frame(A = rnorm(1000000),
                 B = rnorm(1000000),
                 C = rnorm(1000000),
                 D = rnorm(1000000))

# write excel file
openxlsx::write.xlsx(df, "readxl_test.xlsx")

# read excel file
readxl::read_excel("readxl_test.xlsx")
#> # A tibble: 1,000,000 x 4
#>         A       B      C       D
#>     <dbl>   <dbl>  <dbl>   <dbl>
#>  1 -0.141 -1.85    0.410 -1.38  
#>  2  1.45   0.397  -1.40  -0.147 
#>  3 -0.456  0.164  -0.744 -0.0907
#>  4  2.02   0.492   0.157  1.05  
#>  5  2.14  -1.87   -0.946  1.80  
#>  6  3.58   0.502   1.05   1.50  
#>  7  1.34   0.509  -0.627  1.00  
#>  8 -1.82   1.65    1.03  -0.685 
#>  9  0.810  0.0659  0.156 -1.91  
#> 10  0.711  0.393   0.581 -0.543 
#> # ... with 999,990 more rows

# clean up
file.remove("readxl_test.xlsx")
#> [1] TRUE

Created on 2019-01-30 by the reprex package (v0.2.0).

Session info ``` r devtools::session_info() #> - Session info ---------------------------------------------------------- #> setting value #> version R version 3.5.0 (2018-04-23) #> os Windows 7 x64 SP 1 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Germany.1252 #> ctype German_Germany.1252 #> tz Europe/Berlin #> date 2019-01-30 #> #> - Packages -------------------------------------------------------------- #> package * version date lib source #> assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.1) #> backports 1.1.2 2017-12-13 [1] CRAN (R 3.5.0) #> callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.2) #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.5.1) #> cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.1) #> desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.1) #> devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.1) #> digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1) #> evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.1) #> fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.1) #> fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.1) #> glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.1) #> highr 0.7 2018-06-09 [1] CRAN (R 3.5.1) #> htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.1) #> knitr 1.21 2018-12-10 [1] CRAN (R 3.5.2) #> magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.1) #> openxlsx 4.1.0 2018-05-26 [1] CRAN (R 3.5.1) #> pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.2) #> pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.1) #> pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1) #> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.1) #> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.1) #> processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.2) #> ps 1.2.1 2018-11-06 [1] CRAN (R 3.5.1) #> R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.1) #> Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.2) #> readxl 1.1.0 2018-04-20 [1] CRAN (R 3.5.1) #> remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.1) #> rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) #> rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.2) #> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.1) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.1) #> stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.1) #> stringr 1.3.1 2018-05-10 [1] CRAN (R 3.5.1) #> testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.1) #> tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2) #> usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.1) #> utf8 1.1.4 2018-05-24 [1] CRAN (R 3.5.1) #> withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.1) #> xfun 0.4 2018-10-23 [1] CRAN (R 3.5.1) #> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.1) #> zip 1.0.0 2017-04-25 [1] CRAN (R 3.5.1) #> #> [1] C:/Program Files/R/library ```

... here is the result from profvis ...

readxl_test.Rprofvis.zip

... and here a screenshot showing the RAM usage jump from 4.5 GB RAM usage to 6.6 GB:

image

jennybc commented 5 years ago

A little investigation on my machine (I'll have to come back to this)

#set.seed(20190130)

# generate example data
# df <- data.frame(A = rnorm(1000000),
#                  B = rnorm(1000000),
#                  C = rnorm(1000000),
#                  D = rnorm(1000000))

# write excel file
#openxlsx::write.xlsx(df, "investigations/readxl_test.xlsx")

library(memuse)
library(readxl)

memuse::Sys.procmem()
#> Size:  75.227 MiB
x <- read_excel(readxl_example("datasets.xlsx"))
memuse::Sys.procmem()
#> Size:  94.359 MiB

x <- read_excel("~/rrr/readxl/investigations/readxl_test.xlsx")
memuse::Sys.procmem()
#> Size:  371.789 MiB
gc(FALSE)
#>           used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells  531061 28.4    1068411  57.1         NA  1068411  57.1
#> Vcells 5022744 38.4   18449507 140.8      16384 32284119 246.4
memuse::Sys.procmem()
#> Size:  371.816 MiB
rm(x)
memuse::Sys.procmem()
#> Size:  371.816 MiB
gc(FALSE)
#>           used (Mb) gc trigger (Mb) limit (Mb) max used  (Mb)
#> Ncells  531335 28.4    1068411 57.1         NA  1068411  57.1
#> Vcells 1023310  7.9   11807684 90.1      16384 32284119 246.4
memuse::Sys.procmem()
#> Size:  363.844 MiB

Created on 2019-01-30 by the reprex package (v0.2.1)

camroberts commented 5 years ago

I am experiencing this issue too. When reading a sheet from a large xlsx file my RAM is exhausted.

The size of the file on disk is ~175MB and the resultant tibble takes 353MB but the R session expands to 13GB during the read.

> memuse::Sys.procmem()
Size:  225.398 MiB 
Peak:  226.301 MiB 
>   wsws <- read_excel(file.path(data_path, paste0(proj, "-wsws.xlsx")), 
+                      sheet="master_data", col_types="text")
> memuse::Sys.procmem()
Size:  648.625 MiB 
Peak:   13.180 GiB 
> format(object.size(wsws), units="MiB", standard="IEC")
[1] "353.5 MiB"

Session info:

- Session info ----------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.3 (2019-03-11)
 os       Windows 7 x64 SP 1          
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_Australia.1252      
 ctype    English_Australia.1252      
 tz       Australia/Brisbane          
 date     2019-05-13                  

- Packages --------------------------------------------------------------------------------------------------------------------
 package        * version date       lib source        
 assertthat       0.2.1   2019-03-21 [1] CRAN (R 3.5.3)
 backports        1.1.3   2018-12-14 [1] CRAN (R 3.5.2)
 broom            0.5.1   2018-12-05 [1] CRAN (R 3.5.2)
 callr            3.2.0   2019-03-15 [1] CRAN (R 3.5.3)
 cellranger       1.1.0   2016-07-27 [1] CRAN (R 3.5.1)
 cli              1.1.0   2019-03-19 [1] CRAN (R 3.5.3)
 colorspace       1.4-1   2019-03-18 [1] CRAN (R 3.5.3)
 crayon           1.3.4   2017-09-16 [1] CRAN (R 3.5.1)
 desc             1.2.0   2018-05-01 [1] CRAN (R 3.5.1)
 devtools         2.0.1   2018-10-26 [1] CRAN (R 3.5.2)
 digest           0.6.18  2018-10-10 [1] CRAN (R 3.5.2)
 dplyr          * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.3)
 forcats        * 0.4.0   2019-02-17 [1] CRAN (R 3.5.3)
 formatR          1.6     2019-03-05 [1] CRAN (R 3.5.3)
 fs               1.2.7   2019-03-19 [1] CRAN (R 3.5.3)
 futile.logger  * 1.4.3   2016-07-10 [1] CRAN (R 3.5.1)
 futile.options   1.0.1   2018-04-20 [1] CRAN (R 3.5.0)
 generics         0.0.2   2018-11-29 [1] CRAN (R 3.5.2)
 ggplot2        * 3.1.0   2018-10-25 [1] CRAN (R 3.5.2)
 glue             1.3.1   2019-03-12 [1] CRAN (R 3.5.3)
 gtable           0.3.0   2019-03-25 [1] CRAN (R 3.5.3)
 haven            2.1.0   2019-02-19 [1] CRAN (R 3.5.3)
 here             0.1     2017-05-28 [1] CRAN (R 3.5.1)
 hms              0.4.2   2018-03-10 [1] CRAN (R 3.5.1)
 httr             1.4.0   2018-12-11 [1] CRAN (R 3.5.2)
 jsonlite         1.6     2018-12-07 [1] CRAN (R 3.5.2)
 lambda.r         1.2.3   2018-05-17 [1] CRAN (R 3.5.1)
 lattice          0.20-38 2018-11-04 [1] CRAN (R 3.5.2)
 lazyeval         0.2.2   2019-03-15 [1] CRAN (R 3.5.3)
 lubridate      * 1.7.4   2018-04-11 [1] CRAN (R 3.5.1)
 magrittr         1.5     2014-11-22 [1] CRAN (R 3.5.1)
 memoise          1.1.0   2017-04-21 [1] CRAN (R 3.5.1)
 memuse           4.0-0   2017-11-10 [1] CRAN (R 3.5.2)
 modelr           0.1.4   2019-02-18 [1] CRAN (R 3.5.3)
 munsell          0.5.0   2018-06-12 [1] CRAN (R 3.5.1)
 nlme             3.1-137 2018-04-07 [1] CRAN (R 3.5.2)
 ojay           * 0.1.0   2019-05-07 [1] local         
 pillar           1.3.1   2018-12-15 [1] CRAN (R 3.5.2)
 pkgbuild         1.0.3   2019-03-20 [1] CRAN (R 3.5.3)
 pkgconfig        2.0.2   2018-08-16 [1] CRAN (R 3.5.1)
 pkgload          1.0.2   2018-10-29 [1] CRAN (R 3.5.2)
 plyr             1.8.4   2016-06-08 [1] CRAN (R 3.5.1)
 prettyunits      1.0.2   2015-07-13 [1] CRAN (R 3.5.1)
 processx         3.3.0   2019-03-10 [1] CRAN (R 3.5.2)
 ps               1.3.0   2018-12-21 [1] CRAN (R 3.5.2)
 purrr          * 0.3.2   2019-03-15 [1] CRAN (R 3.5.3)
 R6               2.4.0   2019-02-14 [1] CRAN (R 3.5.3)
 Rcpp             1.0.1   2019-03-17 [1] CRAN (R 3.5.3)
 readr          * 1.3.1   2018-12-21 [1] CRAN (R 3.5.2)
 readxl         * 1.3.1   2019-03-13 [1] CRAN (R 3.5.3)
 remotes          2.0.2   2018-10-30 [1] CRAN (R 3.5.3)
 rlang            0.3.3   2019-03-29 [1] CRAN (R 3.5.3)
 rprojroot        1.3-2   2018-01-03 [1] CRAN (R 3.5.1)
 rstudioapi       0.10    2019-03-19 [1] CRAN (R 3.5.3)
 rvest            0.3.2   2016-06-17 [1] CRAN (R 3.5.1)
 scales           1.0.0   2018-08-09 [1] CRAN (R 3.5.1)
 sessioninfo      1.1.1   2018-11-05 [1] CRAN (R 3.5.2)
 stringi          1.4.3   2019-03-12 [1] CRAN (R 3.5.3)
 stringr        * 1.4.0   2019-02-10 [1] CRAN (R 3.5.3)
 testthat         2.0.1   2018-10-13 [1] CRAN (R 3.5.3)
 tibble         * 2.1.1   2019-03-16 [1] CRAN (R 3.5.3)
 tidyr          * 0.8.3   2019-03-01 [1] CRAN (R 3.5.3)
 tidyselect       0.2.5   2018-10-11 [1] CRAN (R 3.5.2)
 tidyverse      * 1.2.1   2017-11-14 [1] CRAN (R 3.5.2)
 usethis          1.4.0   2018-08-14 [1] CRAN (R 3.5.2)
 withr            2.1.2   2018-03-15 [1] CRAN (R 3.5.1)
 wsws           * 0.1.0   2019-05-02 [1] local         
 xml2             1.2.0   2018-01-24 [1] CRAN (R 3.5.1)

[1] C:/Users/robecd/AppData/Local/R/R-3.5.2/library
[2] C:/Users/robecd/AppData/Local/spark/spark-2.3.2-bin-hadoop2.7/R/lib
[3] C:/Users/robecd/AppData/Local/R/R-3.5.3/library
krlmlr commented 3 years ago

I suspect that memuse doesn't report peak memory accurately.

On Linux, I see with an Excel file of 1 MB (!!!):

$ R -q -e 'invisible(unix::rlimit_as(210e6)); readxl::read_xlsx'
# out of memory

$ R -q -e 'invisible(unix::rlimit_as(220e6)); readxl::read_xlsx'
# works

$ R -q -e 'invisible(unix::rlimit_as(430e6)); readxl::read_xlsx("file.xlsx", skip = 8); memuse::Sys.procmem()'
Error: std::bad_alloc
Execution halted

$ R -q -e 'invisible(unix::rlimit_as(440e6)); invisible(readxl::read_xlsx("file.xlsx", skip = 8)); memuse::Sys.procmem()'
Size:  248.137 MiB 
Peak:  275.250 MiB 

(This requires Linux and won't work on the Mac because rlimit_as() is not supported there.)

To translate:

Unfortunately, this scales with the size of the Excel file -- it seems that the required working set is a whopping 200x the size of the Excel file. This makes it impossible to work with anything but tiny Excel files in memory-constrained environments.

Does the bundled libxls library support a somewhat more conservative approach to memory usage?

jennybc commented 3 years ago

@lionel- This is a juicy one for you and memtools 😁

lionel- commented 3 years ago

Unfortunately, since the gc() calls do not indicate excess size retained after the readxl evaluations, it looks like the memory is created / retained outside of R, probably in the bundled libraries. So this probably requires C-level debugging tools with which I'm not familiar. memtools can only help with leaks in the R heap.

bknakker commented 1 year ago

I think this is still happening as of current dev version ‘1.4.3.9000’ with R version 4.1.0 (2021-05-18) on x86_64-apple-darwin17.0, running in Rstudio 2023.6.0.421. I have a not-that large (3.8 MB) excel file which I load every x seconds in an infinite cycle. Regardless of e.g. rm-ing the variable into which I load the excel file or not, the readxl call consumes 3-10 MB of memory so that it is irrecoverable to the R session. As a result of this, if this daemon I have runs for days, it practically can start to use up all the memory on the machine. Interestingly, I think, this issue surfaced during the previous month, I think with the CRAN version of the package (and still persists with the above mentioned dev version).

Madpentiste commented 8 months ago

I had the same issue, but with write_excel() on Linux. That happened with very similar configuration : several hundreds of MB's to process, 32GB RAM Further more, this was in spite of Linux OS which is supposedly using swap memory on hard drive when RAM fills up. Freeing memory with gc() immediately before write_excel() did not help, same was true for removal of all R objects not necessary for executing write_excel() (to the expense of unpracticality afterwards).

Importantly, when I used write.excel() from the openxlsx package, instead of write_excel() from openxlsx2 package, there was no issue : memory usage hardly jumped to few GB, and went back to slightly above initial level afterwards. In other words, there is something in openxlsx2::write_excel() that does not exist in openxlsx::write.excel() that makes it eating up all RAM and crash.

My environment :

I hope this comment will help tracking down the bug.