--py-files already in S3 demands hadoop.fs

nickrvieira commented 3 years ago

I'm not sure if this should be an issue there or a normal behavior expected in Spark.

But we've been trying to work with lightweight spark client for submitting jobs in cluster-mode - so all that we have for that is a pyspark lib.

However, if we do a submit with --py-files argument - declaring files already hosted in S3 - the spark-submit prompts an error, requesting the spark-submit machine to have hadoop.fs in order to process s3 files. spark-submit:

spark-submit --master k8s://EKS_ENDPOINT:443 --conf spark.executor.instances=1 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.container.image=ECR_PATH/spark:latest --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.node.selector.eks.amazonaws.com/nodegroup=node-group-spark-test --conf spark.kubernetes.namespace=test --py-files s3a://bucket/helper_lib.py --num-executors 1 --name test_submit --queue root.default --deploy-mode cluster s3a://app_bucket/test.py --spark-master-url k8s://EKS_ENDPOINT:443

Error (Hadoop FS for accessing those py-files in S3):

Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

I know that this issue is related to those pyfiles in S3 cause I remove it, it won't prompt any errors related to Hadoop and the job will launch.

Is this behavior expected?

vishal98 commented 3 years ago

you will need matching jars for aws-java-sdk.jar, hadoop-aws.jar.

nickrvieira commented 3 years ago

you will need matching jars for aws-java-sdk.jar, hadoop-aws.jar.

It is not about solving the issue (we did it already, before posting there), but it is rather understanding why we need to have aws sdks in our submitting application (even in clustermode) - if those files already are uploaded/storage in S3 and will be able to be fetched through the spark-docker image.

Besides, why does this only triggers if you listed py-files, even though you may have your application script in s3 as well and this doesn't require the "submitting client" to have aws jars/sdks?

vishal98 commented 3 years ago

thanks for clarifying. looks like the spark-submit script tries to download the file before launching the driver in kubernetes cluster mode . https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

search for "isKubernetesClusterModeDriver"

github-actions[bot] commented 23 hours ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kubeflow / spark-operator

--py-files already in S3 demands hadoop.fs #1328