tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 286 forks source link

Memory usage of `read_csv_chunked()` in conjunction with a gzip compressed file #1200

Closed nbenn closed 3 years ago

nbenn commented 3 years ago

Recently I have been running into Error: vector memory exhausted (limit reached?) errors when reading large gzip compressed .csv files using the chunked API. IIRC, earlier versions of readr would explicitly create a temporary file, containing the full uncompressed data, which then was fed into read_csv_chunked().

Looking at reported memory usage, this no longer seems to be the case. If this change was intentional, I apologize for having missed that, but I could not find any announcement hinting at this (neither from NEWS nor docs). Also, I feel this takes away some of the convenience of the chunked API. Of course, this can easily be resolved outside of readr by decompressing files manually beforehand (using for example R.utils::gunzip()).

As it's not straightforward to create an example for this, I'll just add my session info (but I'm happy to provide further information if requested):

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.0.4 (2021-02-15)
 os       macOS Big Sur 10.16
 system   x86_64, darwin17.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Zurich
 date     2021-04-27

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source
 assertthat  * 0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
 callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.2)
 cli           2.5.0   2021-04-26 [1] CRAN (R 4.0.4)
 clisymbols    1.2.0   2017-05-21 [1] CRAN (R 4.0.0)
 colorout    * 1.2-2   2020-05-04 [1] Github (jalvesaq/colorout@726d681)
 crayon        1.4.1   2021-02-08 [1] CRAN (R 4.0.3)
 desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
 devtools    * 2.3.2   2020-09-18 [1] CRAN (R 4.0.2)
 digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
 fansi         0.4.2   2021-01-15 [1] CRAN (R 4.0.2)
 fs            1.4.1   2020-04-04 [1] CRAN (R 4.0.0)
 glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
 hms           1.0.0   2021-01-13 [1] CRAN (R 4.0.2)
 lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.0.3)
 magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
 memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
 memuse        4.1-0   2020-02-17 [1] CRAN (R 4.0.0)
 pillar        1.6.0   2021-04-13 [1] CRAN (R 4.0.2)
 pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
 pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.2)
 prettycode  * 1.1.0   2019-12-16 [1] CRAN (R 4.0.2)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
 processx      3.4.5   2020-11-30 [1] CRAN (R 4.0.2)
 prompt        1.0.0   2020-05-04 [1] Github (gaborcsardi/prompt@b332c42)
 ps            1.4.0   2020-10-07 [1] CRAN (R 4.0.2)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
 R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
 readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.2)
 remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
 rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.2)
 rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.2)
 rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.2)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
 testthat      3.0.2   2021-02-14 [1] CRAN (R 4.0.2)
 tibble        3.1.1   2021-04-18 [1] CRAN (R 4.0.4)
 usethis     * 2.0.1   2021-02-10 [1] CRAN (R 4.0.2)
 utf8          1.2.1   2021-03-12 [1] CRAN (R 4.0.2)
 vctrs         0.3.7   2021-03-29 [1] CRAN (R 4.0.2)
 withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)

[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
jimhester commented 3 years ago

This is likely a duplicate of https://github.com/tidyverse/readr/issues/1161, it is fixed in the devel version of readr but not yet on CRAN.

hsbadr commented 3 years ago

This is likely a duplicate of #1161, it is fixed in the devel version of readr but not yet on CRAN.

@jimhester On the development version, vroom backend (2nd edition) has problems when (incorrectly) reading gzip compressed files, which also increases the object size and memory footprint. This doesn't happen when using the 1st edition (the data is read correctly with no problems), and can be easily captured by setting the correct column types. Here's an example:

COVID19.BR_Municipality <- read_delim(
  "https://github.com/wcota/covid19br/raw/master/cases-brazil-cities-time.csv.gz",
  delim = ",",
  col_types = cols(
    epi_week = col_integer(),
    date = col_date(format = "%Y-%m-%d"),
    country = col_character(),
    state = col_character(),
    city = col_character(),
    ibgeID = col_character(),
    cod_RegiaoDeSaude = col_character(),
    name_RegiaoDeSaude = col_character(),
    newDeaths = col_integer(),
    deaths = col_integer(),
    newCases = col_integer(),
    totalCases = col_integer(),
    deaths_per_100k_inhabitants = col_double(),
    totalCases_per_100k_inhabitants = col_double(),
    deaths_by_totalCases = col_double(),
    `_source` = col_character(),
    last_info_date = col_date(format = "%Y-%m-%d")
  ),
  lazy = FALSE
)
jimhester commented 3 years ago

@hsbadr, that issue is tracked by https://github.com/r-lib/vroom/issues/331 and should be fixed.

hsbadr commented 3 years ago

that issue is tracked by r-lib/vroom#331 and should be fixed.

Thanks @jimhester! I confirm that https://github.com/r-lib/vroom/commit/5fc54e61538e32007b9795bc56e6747e1a77d893 fixed the problem. I'll let you know if I run into a related issue.