Open marianschmidt opened 2 years ago
Hi @marianschmidt; the assumption made by cyphr::encrypt()
(and decrypt()
) is that each read/write operation will write exactly one file. It looks like with partitioning the arrow::write_dataset
call is creating three files, one per partition, and that breaks the model. I don't think that this is easily worked around with the simple call-rewriting approach that cyphr uses, because the logic around partitioned reads and writes happens in compiled code in that package.
Options here are:
Hi @richfitz; Thanks a lot for your prompt reply and sharing possible solutions.
Unfortunately, I think the problem might not only relate to partitioned arrow files; since this additional case also fails (see reprex below).
Possible solutions:
arrow
in cyphr
: Of course, I would be a huge fan of it as I see cyphr
currently as the best way to use encrypted files collaboratively within R and I also see that the arrow data format has a growing fanbase in the R community. But I totally understand that implementing that might take a while.cyphr
currently operates (trying to write out the data as one object; thus hitting the internal R memory limit). I created a separate issue and reprex for that #51. # packages
library(cyphr)
library(arrow)
#>
#> Attache Paket: 'arrow'
#> Das folgende Objekt ist maskiert 'package:utils':
#>
#> timestamp
# To do anything we first need a key:
key <- cyphr::key_sodium(sodium::keygen())
# Register new method for arrow::write_dataset()
cyphr::rewrite_register("arrow", "write_dataset", "path")
ls(cyphr:::db)
#> [1] "arrow::write_dataset" "base::load" "base::readLines"
#> [4] "base::readRDS" "base::save" "base::saveRDS"
#> [7] "base::writeLines" "readxl::read_excel" "readxl::read_xls"
#> [10] "readxl::read_xlsx" "utils::read.csv" "utils::read.csv2"
#> [13] "utils::read.delim" "utils::read.delim2" "utils::read.table"
#> [16] "utils::write.csv" "utils::write.csv2" "utils::write.table"
#> [19] "writexl::write_xlsx"
# arrow::write_dataset() without encryption is working
# both for partitioned and unpartitioned parquet files
arrow::write_dataset(iris, "myfile_arrow_part", partitioning = c("Species"))
list.files("myfile_arrow_part", recursive = TRUE)
#> [1] "Species=setosa/part-0.parquet" "Species=versicolor/part-0.parquet"
#> [3] "Species=virginica/part-0.parquet"
arrow::write_dataset(iris, "myfile_arrow")
list.files("myfile_arrow")
#> [1] "part-0.parquet"
# Trying to encrypt with cyphr results in error message of denied permissions
cyphr::encrypt(write_dataset(iris, "myfile_encrypt_part", partitioning = c("Species")),
key)
#> Warning in file(con, "rb"): kann Datei 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpyED7sT\myfile_encrypt_part20f83dde1b0f'
#> nicht öffnen: Permission denied
#> Error in file(con, "rb"): kann Verbindung nicht öffnen
#> Warning in file.remove(paths[ok]): kann Datei 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpyED7sT\myfile_encrypt_part20f83dde1b0f'
#> nicht löschen. Grund 'Permission denied'
# This problem persists for writing small data without portioning
cyphr::encrypt(write_dataset(iris, "myfile_encrypt"),
key)
#> Warning in file(con, "rb"): kann Datei 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpyED7sT\myfile_encrypt20f844c072c' nicht
#> öffnen: Permission denied
#> Error in file(con, "rb"): kann Verbindung nicht öffnen
#> Warning in file.remove(paths[ok]): kann Datei 'C:
#> \Users\ga27jar\AppData\Local\Temp\RtmpyED7sT\myfile_encrypt20f844c072c' nicht
#> löschen. Grund 'Permission denied'
Created on 2022-06-10 by the reprex package (v2.0.1)
Hi, I have been experimenting with the cyphr package and have hit the memory limit with large .RData files. As an alternative, the arrow package offers partitioning of large data when writing files. I tried to create a new method for
arrow::write_dataset()
, but when usingcyphr::encrypt()
, it results in an error message of denied permissions (using any other build-in write functions of cyphr however works). A reprex with iris below.Created on 2022-06-09 by the reprex package (v2.0.1)
Session info
``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.0 (2022-04-22 ucrt) #> os Windows 10 x64 (build 19044) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Germany.utf8 #> ctype German_Germany.utf8 #> tz Europe/Berlin #> date 2022-06-09 #> pandoc 2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> arrow * 8.0.0 2022-05-09 [1] CRAN (R 4.2.0) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.2.0) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0) #> cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0) #> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0) #> cyphr * 1.1.2 2021-05-17 [1] CRAN (R 4.2.0) #> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.2.0) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0) #> dplyr 1.0.9 2022-04-28 [1] CRAN (R 4.2.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0) #> generics 0.1.2 2022-01-31 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.2.0) #> knitr 1.39 2022-04-26 [1] CRAN (R 4.2.0) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.2.0) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.2.0) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0) #> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.2.0) #> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> sodium 1.2.0 2021-10-21 [1] CRAN (R 4.2.0) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.2.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.2.0) #> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0) #> tibble 3.1.7 2022-05-03 [1] CRAN (R 4.2.0) #> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0) #> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0) #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0) #> #> [1] C:/Users/ga27jar/AppData/Local/R/win-library/4.2 #> [2] C:/Program Files/R/R-4.2.0/library #> #> ────────────────────────────────────────────────────────────────────────────── ```