Support for spark.jars.packages is broken?

pang-wu commented 2 years ago

The issue is the same as this one I believe. tldr; If developer specify jar dependencies through spark.jars.packages these packages won't be distributed to worker node (or maybe distributed not in the class path). One simple way to test this out is run raydp to read a file from S3 by following code:

import ray
import raydp

ray.init(address='auto')
spark = raydp.init_spark(app_name='RayDP Example',
                         num_executors=2,
                         executor_cores=4,
                         executor_memory=8 * 1024 * 1024 * 1024,
                         configs={'spark.jars.packages': 'org.apache.hadoop:hadoop-aws:3.3.1'}
                         )

# Read from S3
df = spark.read.json(path='s3a://....')
df.show()

and you will get error:

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2665)
        ... 40 more

If using a script to download the jar in setup commands, then set the path with spark.jars it will work. But I feel this is not the best thing one should do. Any help will be appreciated.

kira-lin commented 2 years ago

Hi @pang-wu , thanks for opening this issue. We'll look into this. It seems like some options are not copied to executors.

pang-wu commented 2 years ago

Awesome! Please let me know if you need anything from me, we are actively using RayDP on a lot of things, looking forward the tool getting better! @kira-lin

abhay-agarwal commented 1 year ago

Did you ever figure out how to get S3AFileSystem recognized by the executors?

guseggert commented 1 year ago

I needed something similar for GCS, what worked for me is to add the following to setup_commands:

  - mkdir -p ~/jars
  - wget -P ~/jars https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.2.14/gcs-connector-hadoop3-2.2.14-shaded.jar

and then in the spark conf:

"spark.driver.extraClassPath": "/home/ubuntu/jars/gcs-connector-hadoop3-2.2.14-shaded.jar",
"spark.executor.extraClassPath": "/home/ubuntu/jars/gcs-connector-hadoop3-2.2.14-shaded.jar",
"spark.hadoop.fs.AbstractFileSystem.gs.impl":"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS",
"spark.hadoop.fs.gs.impl":"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",

It looks like hadoop-aws does not release a shaded JAR (see https://issues.apache.org/jira/browse/HADOOP-15387 and https://issues.apache.org/jira/browse/HADOOP-15924), but one path forward would be to make your own shaded JAR and then repeat similar steps.

I also noticed that when using spark.jars.packages, the JARs were placed in ~/.ivy2/jars so I added ~/.ivy2/jars/* to the extraClassPath but that did not work for some reason, perhaps someone else will have better luck than me :(.

pang-wu commented 1 year ago

Did you ever figure out how to get S3AFileSystem recognized by the executors?

@abhay-agarwal We have a working solution in prod (AWS S3), by placing the jars needed under the jars directory in our standalone spark deployment (We are not using pyspark installed from pip). In such cases no extra class path needs to be set (we also do the same to enable Glue support). Some scripts for your reference:

wget -nc https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar -P "$SPARK_INSTALL_DIR/jars"
wget -nc https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.901/aws-java-sdk-bundle-1.11.901.jar -P "$SPARK_INSTALL_DIR/jars"
wget -nc https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar -P "$SPARK_INSTALL_DIR/jars"
wget -nc https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/${SPARK_VERSION}/spark-hadoop-cloud_2.12-${SPARK_VERSION}.jar -P "$SPARK_INSTALL_DIR/jars"

abhay-agarwal commented 1 year ago

Did you ever figure out how to get S3AFileSystem recognized by the executors?

@abhay-agarwal We have a working solution in prod (AWS S3), by placing the jars needed under the jars directory in our standalone spark deployment (We are not using pyspark installed from pip). In such cases no extra class path needs to be set (we also do the same to enable Glue support). Some scripts for your reference:
wget -nc https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar -P "$SPARK_INSTALL_DIR/jars"

wget -nc https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.901/aws-java-sdk-bundle-1.11.901.jar -P "$SPARK_INSTALL_DIR/jars"

wget -nc https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar -P "$SPARK_INSTALL_DIR/jars"

wget -nc https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/${SPARK_VERSION}/spark-hadoop-cloud_2.12-${SPARK_VERSION}.jar -P "$SPARK_INSTALL_DIR/jars"

Cool! This is exactly what we ended up doing as well haha. We use docker on KubeRay so just placed similar commands in the docker files.

Deegue commented 1 year ago

Gentle ping @pang-wu , sorry for the late reply to this issue. I tested with the code you provided above and have 2 nodes for this test:

Node A is ray head node, downloaded all the 4 jars to local and client run on this node Node B is the worker node, without any jar and raydp-java-worker run on this node

The result is JSON files in my bucket can be read correctly and everything is working well. So I looked into raydp-java-worker log to find out how jars are copied to other worker nodes. Here are 4 steps as I concluded:

Step 1: Client will open a port like spark://IP:PORT. Step 2: Executor will connect to the port and fetch jars to a local temporary directory. Step 3: Executor will copy jars from temporary location to working directory of this executor. Step 4: Jars will be loaded through class loader.

This is the executor log about hadoop_hadoop-aws-3.3.1.jar:

2023-07-13 06:11:19,076 INFO Executor [dispatcher-Executor]: Fetching spark://10.0.2.140:44539/jars/org.apache.hadoop_hadoop-aws-3.3.1.jar with timestamp 1689228852787
2023-07-13 06:11:19,077 INFO Utils [dispatcher-Executor]: Fetching spark://10.0.2.140:44539/jars/org.apache.hadoop_hadoop-aws-3.3.1.jar to /tmp/ray/session_2023-07-13_06-12-37_355158_3120799/app-20230713061112806/1/spark-3b7a118e-a4b0-4e2a-a92b-44da3b8586ad/fetchFileTemp2388565500155871038.tmp
2023-07-13 06:11:19,089 INFO Utils [dispatcher-Executor]: Copying /tmp/ray/session_2023-07-13_06-12-37_355158_3120799/app-20230713061112806/1/spark-3b7a118e-a4b0-4e2a-a92b-44da3b8586ad/-16085021461689228852787_cache to /tmp/ray/session_2023-07-13_06-12-37_355158_3120799/app-20230713061112806/1/_tmp/org.apache.hadoop_hadoop-aws-3.3.1.jar
2023-07-13 06:11:19,094 INFO Executor [dispatcher-Executor]: Adding file:/tmp/ray/session_2023-07-13_06-12-37_355158_3120799/app-20230713061112806/1/_tmp/org.apache.hadoop_hadoop-aws-3.3.1.jar to class loader

IIUC, this issue is no longer existed right?

Deegue commented 1 year ago

I tried with Spark 3.3.2 and Spark 3.2.1, both of them passed the test. Got errors(same as issue) on all Spark 3.1.x since Hadoop jars are not match.

pang-wu commented 1 year ago

@Deegue Thanks for circle back, just tested on our side (Spark 3.3.2+Ray 2.4.0+RayDP master custom build) and this is fixed! I will close this issue.

Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-b015f014-fa76-426b-a3b9-e17cee6f7b26;1.0
    confs: [default]
    found org.apache.hadoop#hadoop-aws;3.3.1 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.901 in central
    found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar ...
    [SUCCESSFUL ] org.apache.hadoop#hadoop-aws;3.3.1!hadoop-aws.jar (100ms)
downloading https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.901/aws-java-sdk-bundle-1.11.901.jar ...
    [SUCCESSFUL ] com.amazonaws#aws-java-sdk-bundle;1.11.901!aws-java-sdk-bundle.jar (1500ms)
downloading https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar ...
    [SUCCESSFUL ] org.wildfly.openssl#wildfly-openssl;1.0.7.Final!wildfly-openssl.jar (12ms)
:: resolution report :: resolve 3492ms :: artifacts dl 1617ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.901 from central in [default]
    org.apache.hadoop#hadoop-aws;3.3.1 from central in [default]
    org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   3   |   3   |   3   |   0   ||   3   |   3   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-b015f014-fa76-426b-a3b9-e17cee6f7b26
    confs: [default]
    3 artifacts copied, 0 already retrieved (189078kB/165ms)

oap-project / raydp

Support for spark.jars.packages is broken? #247