kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.81k stars 1.38k forks source link

SparkR dependencies in SparkApplication #838

Closed alessioale closed 1 week ago

alessioale commented 4 years ago

Hello,

I'm trying to launch a SparkR application using the Spark Operator

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: sparkr-test-app
spec:
  image: <my-spark-2.4.4-image>
  mainApplicationFile: 'https://artifactory-url/test-r/main.R'
  type: R
  mode: cluster
  sparkVersion: 2.4.4
  deps:
    files:
      - 'https://artifactory-url/test-r/dep.R'
  driver:
    cores: 1
    memory: 512m
    serviceAccount: spark-sa
  imagePullPolicy: IfNotPresent
  executor:
    cores: 1
    instances: 3
    memory: 512m`

Here the main.R and dep.R

main.R

library(SparkR)
sparkR.session(appName = "SparkR-DataFrame-example")
source("dep.R")
Sys.sleep(100)
sparkR.session.stop()

dep.R

localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
df <- createDataFrame(localDF)
printSchema(df)
head(df)
createOrReplaceTempView(df, "people")
teenagers <- sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
teenagersLocalDF <- collect(teenagers)
print(teenagersLocalDF)

When I launch the application I get this error:

Java ref type org.apache.spark.sql.SparkSession id 1 
Error in file(filename, "r", encoding = encoding) : 
  cannot open the connection
Calls: source -> file
In addition: Warning message:
In file(filename, "r", encoding = encoding) :
  cannot open file 'dep.R': No such file or directory
Execution halted

It seems the dependency is not downloaded/loaded I continue to have the same problem also with the following options

sparkConf:
  spark.files: "https://artifactory-url/test-r/dep.R"
deps:
  files: 
    - "https://artifactory-url/test-r/dep.R"
  filesDownloadDir: './'

How can I import dependency files in SparkApplications for R ? Thank you Alessio

liyinan926 commented 4 years ago

Spark 2.4.x changed the way dependencies are downloaded. In Spark 2.3.x you could specify the directory where the files should be downloaded to. In Spark 2.4.x, dependencies are downloaded using the built-in mechanism and they are downloaded into a random directory under the working directory. I'm not sure how SparkR works in this case.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 week ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.