kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.37k forks source link

can mainApplicationFile python file can be fetched from volume mount or directly from S3? #843

Open ssvinoth22 opened 4 years ago

ssvinoth22 commented 4 years ago

can mainApplicationFile python file can be fetched from volume mount/S3 checkout instead of local in docker images? Im using spark opertor in kubernetes, instead of spark submit.. if it is local://, then we need to build images with the python files in it, any changes in file we have rebuild the image? Could you please suggest a way? S3: apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: pyspark-pi namespace: default spec: type: Python pythonVersion: "2" mode: cluster image: "gcr.io/spark-operator/spark-py:v2.4.5" imagePullPolicy: Always **mainApplicationFile: s3:///<bucket>/src/main/python/pi.py** sparkVersion: "2.4.5" Or from mounted volume

liyinan926 commented 4 years ago

You can point mainApplicationFile to a file in S3. When you say "from volume mount", what do you mean by that?

ssvinoth22 commented 4 years ago

mainApplicationFile can be accessed from S3? Can u show some example? when i tried i m getting the below error Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3

May be im missing something.. please do provide some example to use S3 in main applicationfile. If i can use that then i dont. need volume mount...

liyinan926 commented 4 years ago

Yes, you can use mainApplicationFile that is remote in S3 or GCS. If the error message was in the operator logs, it means the operator image is missing necessary jars to handle S3 dependencies. You would need to build a custom operator image with the necessary dependent jars and config.

batCoder95 commented 4 years ago

@ssvinoth22 - I believe this issue of S3 jars can be resolved by adding sparkConf in spec section as per this documentation

Below is an example of my YAML file where I am not seeing the s3 error anymore: apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: pyspark-pi namespace: default spec: type: Python pythonVersion: "3" mode: cluster image: "gcr.io/spark-operator/spark-py:v3.0.0" imagePullPolicy: Always mainApplicationFile: s3a://myBucket/input/appFile.py sparkVersion: "3.0.0" sparkConf: "spark.jars.packages": "com.amazonaws:aws-java-sdk-pom:1.11.271,org.apache.hadoop:hadoop-aws:3.1.0" "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem" "spark.hadoop.fs.s3a.access.key": "<access-key>" "spark.hadoop.fs.s3a.secret.key": "<secret-key>"

@liyinan926 - However after deploying this, I am seeing another error as below: Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-d3d506ae-d79f-45f6-b459-cfa5dc649610-1.0.xml (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:162) at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:70) at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:62) at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:563) at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:176) at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:245) at org.apache.ivy.Ivy.resolve(Ivy.java:523) at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1387) at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Could you please suggest if I'm missing something ?

Thanks in advance! :)

mrkwtz commented 4 years ago

We got the same error. sparkConf: "spark.jars.packages": "org.apache.hadoop:hadoop-aws:3.3.0"

Ivy Default Cache set to: /opt/spark/.ivy2/cache
The jars for the packages stored in: /opt/spark/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-643244df-1233-45fc-b34c-b7c5259e62db;1.0
confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-643244df-1233-45fc-b34c-b7c5259e62db-1.0.xml (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:162)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:70)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:62)
at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:563)
at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:176)
at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:245)
at org.apache.ivy.Ivy.resolve(Ivy.java:523)
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1387)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
stream closed

plathrop commented 4 years ago

Same problem here. I'm really struggling with this. re:

You would need to build a custom operator image with the necessary dependent jars and config.

I tried going down this route, but I'm utterly stumped as to how to add the necessary jars. I can build spark from source, but how do I add in the dependency of hadoop-aws while ensuring transitive dependencies also get installed?

dukkune1 commented 4 years ago

Did anyone get a handle on this problem? I am also having the same issue

dukkune1 commented 4 years ago

@plathrop , can try this link : https://github.com/aws-samples/eks-spark-benchmark/tree/master/docker
It shows how to add the dependency of hadoop-aws to an already built image that has spark.

batCoder95 commented 4 years ago

Hi all,

Apologies for not being able to reply earlier. You can try out this Dockerfile shared by @bbenzikry which has s3 packages installed in it. By using this Dockerfile, you won't need to add S3 packages using YAML file, which will avoid the Ivy Default Cache set to: /opt/spark/.ivy2/cache errors. Attaching sample Dockerfile and YAML file that I used for reference. app_yaml.txt Dockerfile.txt

plathrop commented 4 years ago

Thanks for the reply. I did end up building my own images, with slightly different takes on the above solutions.

jherrmannNetfonds commented 3 years ago

I had a similar problem (Exception in thread "main" java.io.IOException: No FileSystem for scheme: gs) getting mainApplicationFile and dependencies from google cloud storage. End up building a new spark-operater image with necessary dependencies (gcs-connector-hadoop3-2.2.0-shaded.jar) and modified the helm chart by adding a secret holding the service account key to access gcs to charts\spark-operator-chart\templates\deployment.yaml

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.