Open ssvinoth22 opened 4 years ago
You can point mainApplicationFile
to a file in S3. When you say "from volume mount", what do you mean by that?
mainApplicationFile can be accessed from S3? Can u show some example?
when i tried i m getting the below error
Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3
May be im missing something.. please do provide some example to use S3 in main applicationfile. If i can use that then i dont. need volume mount...
Yes, you can use mainApplicationFile
that is remote in S3 or GCS. If the error message was in the operator logs, it means the operator image is missing necessary jars to handle S3 dependencies. You would need to build a custom operator image with the necessary dependent jars and config.
@ssvinoth22 - I believe this issue of S3 jars can be resolved by adding sparkConf in spec section as per this documentation
Below is an example of my YAML file where I am not seeing the s3 error anymore:
apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: pyspark-pi namespace: default spec: type: Python pythonVersion: "3" mode: cluster image: "gcr.io/spark-operator/spark-py:v3.0.0" imagePullPolicy: Always mainApplicationFile: s3a://myBucket/input/appFile.py sparkVersion: "3.0.0" sparkConf: "spark.jars.packages": "com.amazonaws:aws-java-sdk-pom:1.11.271,org.apache.hadoop:hadoop-aws:3.1.0" "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem" "spark.hadoop.fs.s3a.access.key": "<access-key>" "spark.hadoop.fs.s3a.secret.key": "<secret-key>"
@liyinan926 - However after deploying this, I am seeing another error as below:
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-d3d506ae-d79f-45f6-b459-cfa5dc649610-1.0.xml (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:162) at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:70) at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:62) at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:563) at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:176) at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:245) at org.apache.ivy.Ivy.resolve(Ivy.java:523) at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1387) at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Could you please suggest if I'm missing something ?
Thanks in advance! :)
We got the same error.
sparkConf:
"spark.jars.packages": "org.apache.hadoop:hadoop-aws:3.3.0"
Ivy Default Cache set to: /opt/spark/.ivy2/cache
The jars for the packages stored in: /opt/spark/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-643244df-1233-45fc-b34c-b7c5259e62db;1.0
confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-643244df-1233-45fc-b34c-b7c5259e62db-1.0.xml (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:162)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:70)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:62)
at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:563)
at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:176)
at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:245)
at org.apache.ivy.Ivy.resolve(Ivy.java:523)
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1387)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
stream closed
Same problem here. I'm really struggling with this. re:
You would need to build a custom operator image with the necessary dependent jars and config.
I tried going down this route, but I'm utterly stumped as to how to add the necessary jars. I can build spark from source, but how do I add in the dependency of hadoop-aws
while ensuring transitive dependencies also get installed?
Did anyone get a handle on this problem? I am also having the same issue
@plathrop , can try this link : https://github.com/aws-samples/eks-spark-benchmark/tree/master/docker
It shows how to add the dependency of hadoop-aws to an already built image that has spark.
Hi all,
Apologies for not being able to reply earlier. You can try out this Dockerfile shared by @bbenzikry which has s3 packages installed in it. By using this Dockerfile, you won't need to add S3 packages using YAML file, which will avoid the Ivy Default Cache set to: /opt/spark/.ivy2/cache
errors. Attaching sample Dockerfile and YAML file that I used for reference.
app_yaml.txt
Dockerfile.txt
Thanks for the reply. I did end up building my own images, with slightly different takes on the above solutions.
I had a similar problem (Exception in thread "main" java.io.IOException: No FileSystem for scheme: gs
) getting mainApplicationFile and dependencies from google cloud storage.
End up building a new spark-operater image with necessary dependencies (gcs-connector-hadoop3-2.2.0-shaded.jar
) and modified the helm chart by adding a secret holding the service account key to access gcs to charts\spark-operator-chart\templates\deployment.yaml
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
can mainApplicationFile python file can be fetched from volume mount/S3 checkout instead of local in docker images? Im using spark opertor in kubernetes, instead of spark submit.. if it is local://, then we need to build images with the python files in it, any changes in file we have rebuild the image? Could you please suggest a way? S3:
apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: pyspark-pi namespace: default spec: type: Python pythonVersion: "2" mode: cluster image: "gcr.io/spark-operator/spark-py:v2.4.5" imagePullPolicy: Always **mainApplicationFile: s3:///<bucket>/src/main/python/pi.py** sparkVersion: "2.4.5"
Or from mounted volume