Open matthiasgomolka opened 5 years ago
A little investigation on my machine (I'll have to come back to this)
#set.seed(20190130)
# generate example data
# df <- data.frame(A = rnorm(1000000),
# B = rnorm(1000000),
# C = rnorm(1000000),
# D = rnorm(1000000))
# write excel file
#openxlsx::write.xlsx(df, "investigations/readxl_test.xlsx")
library(memuse)
library(readxl)
memuse::Sys.procmem()
#> Size: 75.227 MiB
x <- read_excel(readxl_example("datasets.xlsx"))
memuse::Sys.procmem()
#> Size: 94.359 MiB
x <- read_excel("~/rrr/readxl/investigations/readxl_test.xlsx")
memuse::Sys.procmem()
#> Size: 371.789 MiB
gc(FALSE)
#> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells 531061 28.4 1068411 57.1 NA 1068411 57.1
#> Vcells 5022744 38.4 18449507 140.8 16384 32284119 246.4
memuse::Sys.procmem()
#> Size: 371.816 MiB
rm(x)
memuse::Sys.procmem()
#> Size: 371.816 MiB
gc(FALSE)
#> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells 531335 28.4 1068411 57.1 NA 1068411 57.1
#> Vcells 1023310 7.9 11807684 90.1 16384 32284119 246.4
memuse::Sys.procmem()
#> Size: 363.844 MiB
Created on 2019-01-30 by the reprex package (v0.2.1)
I am experiencing this issue too. When reading a sheet from a large xlsx file my RAM is exhausted.
The size of the file on disk is ~175MB and the resultant tibble takes 353MB but the R session expands to 13GB during the read.
> memuse::Sys.procmem()
Size: 225.398 MiB
Peak: 226.301 MiB
> wsws <- read_excel(file.path(data_path, paste0(proj, "-wsws.xlsx")),
+ sheet="master_data", col_types="text")
> memuse::Sys.procmem()
Size: 648.625 MiB
Peak: 13.180 GiB
> format(object.size(wsws), units="MiB", standard="IEC")
[1] "353.5 MiB"
Session info:
- Session info ----------------------------------------------------------------------------------------------------------------
setting value
version R version 3.5.3 (2019-03-11)
os Windows 7 x64 SP 1
system x86_64, mingw32
ui RStudio
language (EN)
collate English_Australia.1252
ctype English_Australia.1252
tz Australia/Brisbane
date 2019-05-13
- Packages --------------------------------------------------------------------------------------------------------------------
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.3)
backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.2)
broom 0.5.1 2018-12-05 [1] CRAN (R 3.5.2)
callr 3.2.0 2019-03-15 [1] CRAN (R 3.5.3)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.5.1)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.3)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.5.3)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.1)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.1)
devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.2)
digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.2)
dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.3)
forcats * 0.4.0 2019-02-17 [1] CRAN (R 3.5.3)
formatR 1.6 2019-03-05 [1] CRAN (R 3.5.3)
fs 1.2.7 2019-03-19 [1] CRAN (R 3.5.3)
futile.logger * 1.4.3 2016-07-10 [1] CRAN (R 3.5.1)
futile.options 1.0.1 2018-04-20 [1] CRAN (R 3.5.0)
generics 0.0.2 2018-11-29 [1] CRAN (R 3.5.2)
ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.2)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.5.3)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.5.3)
haven 2.1.0 2019-02-19 [1] CRAN (R 3.5.3)
here 0.1 2017-05-28 [1] CRAN (R 3.5.1)
hms 0.4.2 2018-03-10 [1] CRAN (R 3.5.1)
httr 1.4.0 2018-12-11 [1] CRAN (R 3.5.2)
jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.2)
lambda.r 1.2.3 2018-05-17 [1] CRAN (R 3.5.1)
lattice 0.20-38 2018-11-04 [1] CRAN (R 3.5.2)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.5.3)
lubridate * 1.7.4 2018-04-11 [1] CRAN (R 3.5.1)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.1)
memuse 4.0-0 2017-11-10 [1] CRAN (R 3.5.2)
modelr 0.1.4 2019-02-18 [1] CRAN (R 3.5.3)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.1)
nlme 3.1-137 2018-04-07 [1] CRAN (R 3.5.2)
ojay * 0.1.0 2019-05-07 [1] local
pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.2)
pkgbuild 1.0.3 2019-03-20 [1] CRAN (R 3.5.3)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.2)
plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.1)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.1)
processx 3.3.0 2019-03-10 [1] CRAN (R 3.5.2)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.2)
purrr * 0.3.2 2019-03-15 [1] CRAN (R 3.5.3)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.3)
Rcpp 1.0.1 2019-03-17 [1] CRAN (R 3.5.3)
readr * 1.3.1 2018-12-21 [1] CRAN (R 3.5.2)
readxl * 1.3.1 2019-03-13 [1] CRAN (R 3.5.3)
remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.3)
rlang 0.3.3 2019-03-29 [1] CRAN (R 3.5.3)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.1)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.5.3)
rvest 0.3.2 2016-06-17 [1] CRAN (R 3.5.1)
scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.1)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.2)
stringi 1.4.3 2019-03-12 [1] CRAN (R 3.5.3)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.5.3)
testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.3)
tibble * 2.1.1 2019-03-16 [1] CRAN (R 3.5.3)
tidyr * 0.8.3 2019-03-01 [1] CRAN (R 3.5.3)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.2)
tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.5.2)
usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.2)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.1)
wsws * 0.1.0 2019-05-02 [1] local
xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.1)
[1] C:/Users/robecd/AppData/Local/R/R-3.5.2/library
[2] C:/Users/robecd/AppData/Local/spark/spark-2.3.2-bin-hadoop2.7/R/lib
[3] C:/Users/robecd/AppData/Local/R/R-3.5.3/library
I suspect that memuse doesn't report peak memory accurately.
On Linux, I see with an Excel file of 1 MB (!!!):
$ R -q -e 'invisible(unix::rlimit_as(210e6)); readxl::read_xlsx'
# out of memory
$ R -q -e 'invisible(unix::rlimit_as(220e6)); readxl::read_xlsx'
# works
$ R -q -e 'invisible(unix::rlimit_as(430e6)); readxl::read_xlsx("file.xlsx", skip = 8); memuse::Sys.procmem()'
Error: std::bad_alloc
Execution halted
$ R -q -e 'invisible(unix::rlimit_as(440e6)); invisible(readxl::read_xlsx("file.xlsx", skip = 8)); memuse::Sys.procmem()'
Size: 248.137 MiB
Peak: 275.250 MiB
(This requires Linux and won't work on the Mac because rlimit_as()
is not supported there.)
To translate:
Unfortunately, this scales with the size of the Excel file -- it seems that the required working set is a whopping 200x the size of the Excel file. This makes it impossible to work with anything but tiny Excel files in memory-constrained environments.
Does the bundled libxls library support a somewhat more conservative approach to memory usage?
@lionel- This is a juicy one for you and memtools 😁
Unfortunately, since the gc()
calls do not indicate excess size retained after the readxl evaluations, it looks like the memory is created / retained outside of R, probably in the bundled libraries. So this probably requires C-level debugging tools with which I'm not familiar. memtools can only help with leaks in the R heap.
I think this is still happening as of current dev version ‘1.4.3.9000’ with R version 4.1.0 (2021-05-18) on x86_64-apple-darwin17.0, running in Rstudio 2023.6.0.421. I have a not-that large (3.8 MB) excel file which I load every x seconds in an infinite cycle. Regardless of e.g. rm-ing the variable into which I load the excel file or not, the readxl call consumes 3-10 MB of memory so that it is irrecoverable to the R session. As a result of this, if this daemon I have runs for days, it practically can start to use up all the memory on the machine. Interestingly, I think, this issue surfaced during the previous month, I think with the CRAN version of the package (and still persists with the above mentioned dev version).
I had the same issue, but with write_excel() on Linux. That happened with very similar configuration : several hundreds of MB's to process, 32GB RAM Further more, this was in spite of Linux OS which is supposedly using swap memory on hard drive when RAM fills up. Freeing memory with gc() immediately before write_excel() did not help, same was true for removal of all R objects not necessary for executing write_excel() (to the expense of unpracticality afterwards).
Importantly, when I used write.excel() from the openxlsx package, instead of write_excel() from openxlsx2 package, there was no issue : memory usage hardly jumped to few GB, and went back to slightly above initial level afterwards. In other words, there is something in openxlsx2::write_excel() that does not exist in openxlsx::write.excel() that makes it eating up all RAM and crash.
My environment :
I hope this comment will help tracking down the bug.
I read a few larger Excel files (several hundreds of MB's) and noticed that the RAM usage almost instantly jumped to 100%. This is unexpected since that machine has 32GB of RAM and only ~ 5GB were used before calling
read_excel()
.Thus, I reproduced the problem on a smaller scale. The example xlsx file is attached in the zip file. When I read that file with
r readxl::read_excel("readxl_test.xlsx")
I see in the Windows task manager that RAM usage jumps about 2GB higher during the process. However, profiling the call shows that only ~ 240 MB were used. For small files, this is no problem, for large ones, it is.Here is the reprex including the
sessionInfo()
...Created on 2019-01-30 by the reprex package (v0.2.0).
Session info
``` r devtools::session_info() #> - Session info ---------------------------------------------------------- #> setting value #> version R version 3.5.0 (2018-04-23) #> os Windows 7 x64 SP 1 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Germany.1252 #> ctype German_Germany.1252 #> tz Europe/Berlin #> date 2019-01-30 #> #> - Packages -------------------------------------------------------------- #> package * version date lib source #> assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.1) #> backports 1.1.2 2017-12-13 [1] CRAN (R 3.5.0) #> callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.2) #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.5.1) #> cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.1) #> desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.1) #> devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.1) #> digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1) #> evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.1) #> fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.1) #> fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.1) #> glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.1) #> highr 0.7 2018-06-09 [1] CRAN (R 3.5.1) #> htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.1) #> knitr 1.21 2018-12-10 [1] CRAN (R 3.5.2) #> magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.1) #> openxlsx 4.1.0 2018-05-26 [1] CRAN (R 3.5.1) #> pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.2) #> pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.1) #> pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1) #> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.1) #> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.1) #> processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.2) #> ps 1.2.1 2018-11-06 [1] CRAN (R 3.5.1) #> R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.1) #> Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.2) #> readxl 1.1.0 2018-04-20 [1] CRAN (R 3.5.1) #> remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.1) #> rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) #> rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.2) #> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.1) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.1) #> stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.1) #> stringr 1.3.1 2018-05-10 [1] CRAN (R 3.5.1) #> testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.1) #> tibble 2.0.1 2019-01-12 [1] CRAN (R 3.5.2) #> usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.1) #> utf8 1.1.4 2018-05-24 [1] CRAN (R 3.5.1) #> withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.1) #> xfun 0.4 2018-10-23 [1] CRAN (R 3.5.1) #> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.1) #> zip 1.0.0 2017-04-25 [1] CRAN (R 3.5.1) #> #> [1] C:/Program Files/R/library ```... here is the result from profvis ...
readxl_test.Rprofvis.zip
... and here a screenshot showing the RAM usage jump from 4.5 GB RAM usage to 6.6 GB: