ropensci / cyphr

:shipit: Humane encryption
https://docs.ropensci.org/cyphr
Other
93 stars 10 forks source link

Memory limit error for `encrypt()` #51

Closed marianschmidt closed 2 years ago

marianschmidt commented 2 years ago

Currently, I try to figure out the performance of the cyphr package on large datasets. It seems that for data that cannot be well compressed (random strings), cyphr::encrypt() soon reaches some memory limits (10 M rows, 2 columns of which 1 is a long string with 500 characters). This limit seems to be independent of available system RAM and OS as I tested with (8GB, 16 GB, 32 GB on Windows 10; 170GB on Linux cluster) and has always executed saveRDS() without problem, but got an error for cyphr::encrypt(saveRDS())

In the reprex, about 3.5 GB of RAM are used according to the RStudio memory usage report and writing the unencrypted compressed RDS file takes about 3.3 GB of storage.

This reprex takes about 3 minutes to run on a normal PC.

# packages
library(cyphr)
library(stringi)

# creating a data.frame with long random strings
rows <- 1E7
str_len <- 500 #length of strings
str_n <- 1000  #number of different strings
rand_strings <- stringi::stri_rand_strings(str_n, str_len)

large_data <- data.frame(
  id = 1:rows,
  year = sample(1980:2020, size = rows, replace = TRUE),
  long_str = sample(rand_strings, size = rows, replace = TRUE)
)

# To do anything we first need a key:
key <- cyphr::key_sodium(sodium::keygen())

# Save large file unencrypted to figure out compressed size
# saveRDS(large_data, "myfile.rds")
# fs::file_size("myfile.rds")
# this file is about 3.3 GB when written unencrypted to disk (standard compression of rds)

# be careful, running this command will take about 3-10 minutes, before error is thrown
# Save large data with encryption
cyphr::encrypt(saveRDS(large_data, "myfile_encr.rds"), key)
#> Error in encrypt(msg, key()): lange Vektoren noch nicht unterstützt: memory.c:3887

# --> Error: Error in encrypt(msg, key()) : long vectors not supported yet: memory.c:3887

Created on 2022-06-10 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.0 (2022-04-22 ucrt) #> os Windows 10 x64 (build 19044) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Germany.utf8 #> ctype German_Germany.utf8 #> tz Europe/Berlin #> date 2022-06-10 #> pandoc 2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0) #> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0) #> cyphr * 1.1.2 2021-05-17 [1] CRAN (R 4.2.0) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.2.0) #> knitr 1.39 2022-04-26 [1] CRAN (R 4.2.0) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.2.0) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.2.0) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.2.0) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0) #> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.2.0) #> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> sodium 1.2.0 2021-10-21 [1] CRAN (R 4.2.0) #> stringi * 1.7.6 2021-11-29 [1] CRAN (R 4.2.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.2.0) #> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0) #> tibble 3.1.7 2022-05-03 [1] CRAN (R 4.2.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0) #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0) #> #> [1] C:/Users/ga27jar/AppData/Local/R/win-library/4.2 #> [2] C:/Program Files/R/R-4.2.0/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
richfitz commented 2 years ago

Thanks - I've made a PR into sodium (https://github.com/jeroen/sodium/pull/22) that should fix this issue, I hope. Worth noting that running this may just run your machine out of memory though!

marianschmidt commented 2 years ago

Thanks a lot for fixing this so quickly. I really appreciate your help. Bumping the dependency to sodium >= 1.2.1 would fix this for cyphr.

richfitz commented 2 years ago

This is on cran now (https://github.com/ropensci/cyphr/pull/52)