Open nickrvieira opened 3 years ago
you will need matching jars for aws-java-sdk.jar, hadoop-aws.jar.
you will need matching jars for aws-java-sdk.jar, hadoop-aws.jar.
It is not about solving the issue (we did it already, before posting there), but it is rather understanding why we need to have aws sdks in our submitting application (even in clustermode) - if those files already are uploaded/storage in S3 and will be able to be fetched through the spark-docker image.
Besides, why does this only triggers if you listed py-files, even though you may have your application script in s3 as well and this doesn't require the "submitting client" to have aws jars/sdks?
thanks for clarifying. looks like the spark-submit script tries to download the file before launching the driver in kubernetes cluster mode . https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
search for "isKubernetesClusterModeDriver"
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I'm not sure if this should be an issue there or a normal behavior expected in Spark.
But we've been trying to work with lightweight spark client for submitting jobs in cluster-mode - so all that we have for that is a
pyspark
lib.However, if we do a submit with --py-files argument - declaring files already hosted in S3 - the spark-submit prompts an error, requesting the
spark-submit
machine to have hadoop.fs in order to process s3 files. spark-submit:Error (Hadoop FS for accessing those py-files in S3):
I know that this issue is related to those pyfiles in S3 cause I remove it, it won't prompt any errors related to Hadoop and the job will launch.
Is this behavior expected?