uber / RemoteShuffleService

Remote shuffle service for Apache Spark to store shuffle data on remote servers.
Other
321 stars 100 forks source link

which branch should be used for building jar and image for remote shuffle service for K8 environment? #63

Open roligupt opened 2 years ago

roligupt commented 2 years ago

I see there are 2 branches K8 and rss-k8, which branch should be used for building jar and image for remote shuffle service for K8 environment?

hiboyang commented 2 years ago

I have a fork, and make Remote Shuffle Sevice work on k8s. Also removed dependence on ZooKeeper. The fork is here: https://github.com/datapunchorg/RemoteShuffleService/tree/k8s-spark-3.1

roligupt commented 2 years ago

I have a fork, and make Remote Shuffle Sevice work on k8s. Also removed dependence on ZooKeeper. The fork is here: https://github.com/datapunchorg/RemoteShuffleService/tree/k8s-spark-3.1

Thanks for your quick response! I will try it out.

roligupt commented 2 years ago

I have a fork, and make Remote Shuffle Sevice work on k8s. Also removed dependence on ZooKeeper. The fork is here: https://github.com/datapunchorg/RemoteShuffleService/tree/k8s-spark-3.1

@hiboyang one quick question about spark with client jar - I wanted to build my own spark image with the jar. Although I am not building the spark distribution from scratch but using the spark bin (spark-3.1.1-bin-hadoop3.2.tgz) that is provided on apache spark download side. how do go about building the client jar to include in spark image?

hiboyang commented 2 years ago

You need to put the Remote Shuffle Service client jar file inside jars folder in Spark image.

You could download Remote Shuffle Service client jar file from Maven:

    <dependency>
        <groupId>org.datapunch</groupId>
        <artifactId>remote-shuffle-service-client-spark31</artifactId>
        <version>0.0.12</version>
    </dependency>

If you download that Spark bin (spark-3.1.1-bin-hadoop3.2.tgz), you could unzip it, add Remote Shuffle Service client jar file to jars folder, then run command like following to build your image:

./dev/make-distribution.sh --name spark-with-remote-shuffle-service-client --pip --tgz -Phive -Phive-thriftserver -Pkubernetes -Phadoop-3.2 -Phadoop-cloud

Please note if you use the remote-shuffle-service-client-spark31 jar file here, you need to use Remote Shuffle Service server from this branch as well: https://github.com/datapunchorg/RemoteShuffleService/tree/k8s-spark-3.1

roligupt commented 2 years ago

You need to put the Remote Shuffle Service client jar file inside jars folder in Spark image.

You could download Remote Shuffle Service client jar file from Maven:

    <dependency>
        <groupId>org.datapunch</groupId>
        <artifactId>remote-shuffle-service-client-spark31</artifactId>
        <version>0.0.12</version>
    </dependency>

If you download that Spark bin (spark-3.1.1-bin-hadoop3.2.tgz), you could unzip it, add Remote Shuffle Service client jar file to jars folder, then run command like following to build your image:

./dev/make-distribution.sh --name spark-with-remote-shuffle-service-client --pip --tgz -Phive -Phive-thriftserver -Pkubernetes -Phadoop-3.2 -Phadoop-cloud

Please note if you use the remote-shuffle-service-client-spark31 jar file here, you need to use Remote Shuffle Service server from this branch as well: https://github.com/datapunchorg/RemoteShuffleService/tree/k8s-spark-3.1

@hiboyang I understand everything except that spark-3.1.1-bin-hadoop3.2.tgz is already a distribution package which comes with the jar files. and as far as i understand "./dev/make-distribution.sh " creates the distribution package. If I already have the spark binaries in spark-3.1.1-bin-hadoop3.2.tgz I dont need to run "./dev/make-distribution.sh" I can simply Copy the jar and build the image.

hiboyang commented 2 years ago

Yes, it should work as well for "simply Copy the jar and build the image".