ropensci / cyphr

:shipit: Humane encryption
https://docs.ropensci.org/cyphr
Other
93 stars 10 forks source link

Is it possible to encrypt `arrow` files using the `cyphr` package? #50

Open marianschmidt opened 2 years ago

marianschmidt commented 2 years ago

Hi, I have been experimenting with the cyphr package and have hit the memory limit with large .RData files. As an alternative, the arrow package offers partitioning of large data when writing files. I tried to create a new method for arrow::write_dataset(), but when using cyphr::encrypt(), it results in an error message of denied permissions (using any other build-in write functions of cyphr however works). A reprex with iris below.

# packages
library(cyphr)
library(arrow)
#> 
#> Attache Paket: 'arrow'
#> Das folgende Objekt ist maskiert 'package:utils':
#> 
#>     timestamp

# To do anything we first need a key:
key <- cyphr::key_sodium(sodium::keygen())

# Register new method for arrow::write_dataset()
cyphr::rewrite_register("arrow", "write_dataset", "path")
ls(cyphr:::db)
#>  [1] "arrow::write_dataset" "base::load"           "base::readLines"     
#>  [4] "base::readRDS"        "base::save"           "base::saveRDS"       
#>  [7] "base::writeLines"     "readxl::read_excel"   "readxl::read_xls"    
#> [10] "readxl::read_xlsx"    "utils::read.csv"      "utils::read.csv2"    
#> [13] "utils::read.delim"    "utils::read.delim2"   "utils::read.table"   
#> [16] "utils::write.csv"     "utils::write.csv2"    "utils::write.table"  
#> [19] "writexl::write_xlsx"

# Trying to encrypt with cyphr results in error message of denied permissions
cyphr::encrypt(write_dataset(iris, tempfile(), partitioning = c("Species")), 
               key)
#> Warning in file(con, "rb"): cannot open file 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpKw7PXv\filed4c33d93cd0d4c2d2f10cf'
#> Permission denied
#> Error in file(con, "rb"): cannot open the connection
#> Warning in file.remove(paths[ok]):  cannot remove file 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpKw7PXv\filed4c33d93cd0d4c2d2f10cf'
#> 'Permission denied'

Created on 2022-06-09 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.0 (2022-04-22 ucrt) #> os Windows 10 x64 (build 19044) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Germany.utf8 #> ctype German_Germany.utf8 #> tz Europe/Berlin #> date 2022-06-09 #> pandoc 2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> arrow * 8.0.0 2022-05-09 [1] CRAN (R 4.2.0) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.2.0) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0) #> cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0) #> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0) #> cyphr * 1.1.2 2021-05-17 [1] CRAN (R 4.2.0) #> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.2.0) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0) #> dplyr 1.0.9 2022-04-28 [1] CRAN (R 4.2.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0) #> generics 0.1.2 2022-01-31 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.2.0) #> knitr 1.39 2022-04-26 [1] CRAN (R 4.2.0) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.2.0) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.2.0) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0) #> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.2.0) #> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> sodium 1.2.0 2021-10-21 [1] CRAN (R 4.2.0) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.2.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.2.0) #> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0) #> tibble 3.1.7 2022-05-03 [1] CRAN (R 4.2.0) #> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0) #> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0) #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0) #> #> [1] C:/Users/ga27jar/AppData/Local/R/win-library/4.2 #> [2] C:/Program Files/R/R-4.2.0/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
richfitz commented 2 years ago

Hi @marianschmidt; the assumption made by cyphr::encrypt() (and decrypt()) is that each read/write operation will write exactly one file. It looks like with partitioning the arrow::write_dataset call is creating three files, one per partition, and that breaks the model. I don't think that this is easily worked around with the simple call-rewriting approach that cyphr uses, because the logic around partitioned reads and writes happens in compiled code in that package.

Options here are:

marianschmidt commented 2 years ago

Hi @richfitz; Thanks a lot for your prompt reply and sharing possible solutions.

  1. Unfortunately, I think the problem might not only relate to partitioned arrow files; since this additional case also fails (see reprex below).

  2. Possible solutions:

# packages
library(cyphr)
library(arrow)
#> 
#> Attache Paket: 'arrow'
#> Das folgende Objekt ist maskiert 'package:utils':
#> 
#>     timestamp

# To do anything we first need a key:
key <- cyphr::key_sodium(sodium::keygen())

# Register new method for arrow::write_dataset()
cyphr::rewrite_register("arrow", "write_dataset", "path")
ls(cyphr:::db)
#>  [1] "arrow::write_dataset" "base::load"           "base::readLines"     
#>  [4] "base::readRDS"        "base::save"           "base::saveRDS"       
#>  [7] "base::writeLines"     "readxl::read_excel"   "readxl::read_xls"    
#> [10] "readxl::read_xlsx"    "utils::read.csv"      "utils::read.csv2"    
#> [13] "utils::read.delim"    "utils::read.delim2"   "utils::read.table"   
#> [16] "utils::write.csv"     "utils::write.csv2"    "utils::write.table"  
#> [19] "writexl::write_xlsx"

# arrow::write_dataset() without encryption is working 
# both for partitioned and unpartitioned parquet files
arrow::write_dataset(iris, "myfile_arrow_part", partitioning = c("Species"))
list.files("myfile_arrow_part", recursive = TRUE)
#> [1] "Species=setosa/part-0.parquet"     "Species=versicolor/part-0.parquet"
#> [3] "Species=virginica/part-0.parquet"
arrow::write_dataset(iris, "myfile_arrow")
list.files("myfile_arrow")
#> [1] "part-0.parquet"

# Trying to encrypt with cyphr results in error message of denied permissions
cyphr::encrypt(write_dataset(iris, "myfile_encrypt_part", partitioning = c("Species")), 
               key)
#> Warning in file(con, "rb"): kann Datei 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpyED7sT\myfile_encrypt_part20f83dde1b0f'
#> nicht öffnen: Permission denied
#> Error in file(con, "rb"): kann Verbindung nicht öffnen
#> Warning in file.remove(paths[ok]): kann Datei 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpyED7sT\myfile_encrypt_part20f83dde1b0f'
#> nicht löschen. Grund 'Permission denied'

# This problem persists for writing small data without portioning
cyphr::encrypt(write_dataset(iris, "myfile_encrypt"),
               key)
#> Warning in file(con, "rb"): kann Datei 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpyED7sT\myfile_encrypt20f844c072c' nicht
#> öffnen: Permission denied
#> Error in file(con, "rb"): kann Verbindung nicht öffnen
#> Warning in file.remove(paths[ok]): kann Datei 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpyED7sT\myfile_encrypt20f844c072c' nicht
#> löschen. Grund 'Permission denied'

Created on 2022-06-10 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.1.3 (2022-03-10) #> os Windows 10 x64 (build 19044) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Germany.1252 #> ctype German_Germany.1252 #> tz Europe/Berlin #> date 2022-06-10 #> pandoc 2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown) #> #> - Packages ------------------------------------------------------------------- #> package * version date (UTC) lib source #> arrow * 8.0.0 2022-05-09 [1] CRAN (R 4.1.3) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.2) #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.2) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.2) #> cli 3.3.0 2022-04-25 [1] CRAN (R 4.1.3) #> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.1.3) #> cyphr * 1.1.2 2021-05-17 [1] CRAN (R 4.1.2) #> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.2) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2) #> dplyr 1.0.9 2022-04-28 [1] CRAN (R 4.1.3) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.2) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.1.2) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.1.3) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.2) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2) #> generics 0.1.2 2022-01-31 [1] CRAN (R 4.1.2) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.2) #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.2) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.2) #> knitr 1.39 2022-04-26 [1] CRAN (R 4.1.3) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.2) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.3) #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.2) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.2) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.2) #> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.1.3) #> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.1.3) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.2) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2) #> sodium 1.2.0 2021-10-21 [1] CRAN (R 4.1.2) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.2) #> tibble 3.1.7 2022-05-03 [1] CRAN (R 4.1.3) #> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.1.2) #> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.1.3) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.2) #> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.1.3) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.2) #> xfun 0.31 2022-05-10 [1] CRAN (R 4.1.3) #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.1.2) #> #> [1] C:/Users/ga27jar/Documents/R/win-library/4.1 #> [2] C:/Program Files/R/R-4.1.3/library #> #> ------------------------------------------------------------------------------ ```