sparklyr / sparklyr

R interface for Apache Spark
https://spark.rstudio.com/
Apache License 2.0
945 stars 306 forks source link

spark_apply returns Error in file(con, "r").... Permission denied #2472

Open vincenzzimmer opened 4 years ago

vincenzzimmer commented 4 years ago

When I try connecting to Spark master (standalone) via

sc <- spark_connect(master = "spark://myip:7077")

I get the following error:

_Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") : cannot open file 'C:\Users\ADM_DE~2\AppData\Local\Temp\2\RtmpEH4UkI\file1c6c243e351dspark.log': Permission denied

I already read plenty of posts/issues on Gitub and Stack Overflow related to this kind of error messages. I tried to run the code from RStudio, Rterm and with spark-submit. I tried running it as administrator and also installed the latest version of sparklier from Github. I even uninstalled my complete R installation to install everything from scratch (now R 4.0.0; before R 3.6.1). Now I don't know what else I could try and, therefore, am rather convinced that this must be a bug.

Here some information on my environment:

I have a standalone Spark running on a Windows Server 2016 machine with master and worker on the same machine. The Spark version is spark-2.4.4-bin-without-hadoop along with hadoop-2.8.2 (to connect spark with MinIO according to link).

sessionInfo() R version 4.0.0 (2020-04-24) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows Server x64 (build 14393)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] sparklyr_1.2.0.9000

loaded via a namespace (and not attached): [1] Rcpp_1.0.4.6 rstudioapi_0.11 magrittr_1.5 tidyselect_1.0.0 R6_2.4.1 rlang_0.4.6
[7] httr_1.4.1 dplyr_0.8.5 tools_4.0.0 parallel_4.0.0 config_0.3 DBI_1.1.0
[13] withr_2.2.0 dbplyr_1.4.3 askpass_1.1 htmltools_0.4.0 ellipsis_0.3.0 openssl_1.4.1
[19] yaml_2.2.1 assertthat_0.2.1 rprojroot_1.3-2 digest_0.6.25 tibble_3.0.1 lifecycle_0.2.0
[25] forge_0.2.0 crayon_1.3.4 purrr_0.3.4 base64enc_0.1-3 htmlwidgets_1.5.1 vctrs_0.2.4
[31] glue_1.4.0 compiler_4.0.0 pillar_1.4.4 generics_0.0.2 r2d3_0.2.3 backports_1.1.6
[37] jsonlite_1.6.1 pkgconfig_2.0.3

yitao-li commented 4 years ago

@vincenzzimmer It might be worthwhile to try changing your temp directory location and see whether that helps (i.e., something like https://www.howtogeek.com/285710/how-to-move-windows-temporary-folders-to-another-drive). I don't have much experience on Windows though, so for now that's all I can think of. Also, you can check the permission attributes of your current temp directory to see if any read/write permission is different from normal.

yitao-li commented 4 years ago

alternatively can also try having the following

TMPDIR=<writable location>
TMP=<writable location>
TEMP=<writable location>

in your Renviron.site config file and restart RStudio for the change to take effect

Meanwhile because I have never seen issues like this on Mac OS or Linux, I'm suspecting it's an OS-specific problem.

vincenzzimmer commented 4 years ago

Thank you for the quick response. I have set the TMP and TEMP environment variables to another user-independent folder (D:\temp) and gave my user and the user running the spark services all available permissions. The error message is still the same.

I looked for the log-file at the location shown in the message and it was there. I could not open it, while RStudio was running. image After closing RStudio I could open it without any issues. Maybe the problem is not a permission issue but an issue due to the log being blocked by a process accessing it in parallel.

I am sure that this is a OS specific issue. Unfortunately, I have no Linux OS available, yet.

yitao-li commented 4 years ago

@vincenzzimmer Wow thanks for posting that error msg :) I never realized 2 processes cannot access the log file at the same time on windows... even though 1 of them was read-only (i.e., file(con, "r") etc)

So yeah... That at least gives me more clue on how to possibly workaround this type of problem.

vincenzzimmer commented 4 years ago

@yl790 Thank you for your support. Please tell me, if I can help with more information or run some tests.

not4everybody commented 3 years ago

The problem still persists. I face it using Windows 10, R 3.6.3, RStudio 1.3.1093, sparklyr 1.4.0. I noticed, that the *_spark.log file gets created without problems in a folder where I have read/write rights. But as soon as sparklyr/spark_connect tries to open this file, I get the permission denied error. I can also confirm what vincenzzimmer wrote, that I can delete the files as soon as I close RStudio. On my linux system (Ubuntu), it works completetly fine.

edit

I just tried it with sparklyr version 1.2.0 and RStudio 1.2.5033 and it now works again.

yitao-li commented 3 years ago

@not4everybody I think there is definitely some weird OS-specific race condition happening.

Both sparklyr and Apache Spark don't implement anything OS-specific when it comes to creating _spark.log and redirecting Spark log entries to the file, as far as I can tell, so the fact that this is only happening on Windows really puzzles me. Also, there is no change on how the _spark.log file is created. It has been the same implementation from 1.2 to sparklyr 1.4.

I'll let you know if I find some possible fix or workaround for this problem.

stlibest commented 1 year ago

I have a same error. any update? thanks. More information: 22/11/08 16:08:06 ERROR sparklyr: Gateway (5142) failed calling getOrCreate on 8: java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x67c33749) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module @0x67c33749

schuemie commented 1 year ago

I have the same error:

connection <- sparklyr::spark_connect(master = "spark://<ip address>:<port number>",
+                                       spark_home = "C:/Users/admin_mschuemi/AppData/Local/spark/spark-2.4.3-bin-hadoop2.7")
# Error in file(con, "r") : cannot open the connection
# In addition: Warning message:
# In file(con, "r") :
#   cannot open file 'D:/temp/Rtemp\Rtmp0MzxYf\file1994509d3791_spark.log': Permission denied

I've changed the temp folder settings as proposed, which has no effect. I definitely have write access to that location.

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sparklyr_1.7.8

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13   magrittr_2.0.3    tidyselect_1.1.2  R6_2.5.1          rlang_1.0.2       fastmap_1.1.0     fansi_1.0.3      
 [8] httr_1.4.3        dplyr_1.0.9       tools_4.1.1       parallel_4.1.1    config_0.3.1      utf8_1.2.2        cli_3.3.0        
[15] DBI_1.1.2         withr_2.5.0       dbplyr_2.2.1      askpass_1.1       htmltools_0.5.2   ellipsis_0.3.2    openssl_2.0.2    
[22] yaml_2.3.5        assertthat_0.2.1  digest_0.6.29     rprojroot_2.0.3   tibble_3.1.7      lifecycle_1.0.1   forge_0.2.0      
[29] crayon_1.5.1      tidyr_1.2.0       purrr_0.3.4       base64enc_0.1-3   htmlwidgets_1.5.4 vctrs_0.4.1       glue_1.6.2       
[36] compiler_4.1.1    pillar_1.7.0      r2d3_0.2.6        generics_0.1.2    jsonlite_1.8.0    pkgconfig_2.0.3  
stlibest commented 1 year ago

I have the same error:

connection <- sparklyr::spark_connect(master = "spark://<ip address>:<port number>",
+                                       spark_home = "C:/Users/admin_mschuemi/AppData/Local/spark/spark-2.4.3-bin-hadoop2.7")
# Error in file(con, "r") : cannot open the connection
# In addition: Warning message:
# In file(con, "r") :
#   cannot open file 'D:/temp/Rtemp\Rtmp0MzxYf\file1994509d3791_spark.log': Permission denied

I've changed the temp folder settings as proposed, which has no effect. I definitely have write access to that location.

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sparklyr_1.7.8

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13   magrittr_2.0.3    tidyselect_1.1.2  R6_2.5.1          rlang_1.0.2       fastmap_1.1.0     fansi_1.0.3      
 [8] httr_1.4.3        dplyr_1.0.9       tools_4.1.1       parallel_4.1.1    config_0.3.1      utf8_1.2.2        cli_3.3.0        
[15] DBI_1.1.2         withr_2.5.0       dbplyr_2.2.1      askpass_1.1       htmltools_0.5.2   ellipsis_0.3.2    openssl_2.0.2    
[22] yaml_2.3.5        assertthat_0.2.1  digest_0.6.29     rprojroot_2.0.3   tibble_3.1.7      lifecycle_1.0.1   forge_0.2.0      
[29] crayon_1.5.1      tidyr_1.2.0       purrr_0.3.4       base64enc_0.1-3   htmlwidgets_1.5.4 vctrs_0.4.1       glue_1.6.2       
[36] compiler_4.1.1    pillar_1.7.0      r2d3_0.2.6        generics_0.1.2    jsonlite_1.8.0    pkgconfig_2.0.3  

It seems a known issue on Windows OS [TEMP/TMP folder created by Java as per Stack overflow]. I installed it on Ubuntu [WSL]. It worked well [R/Python/SQL].