rheem-ecosystem / rheem

Rheem - a cross-platform data processing system
https://rheem-ecosystem.github.io
5 stars 0 forks source link

probelms when deploying Rheem on Spark #73

Closed zhipeng93 closed 7 years ago

zhipeng93 commented 7 years ago

Hi, I want to run some code using rheem on with Spark following docs from http://da.qcri.org/rheem/download.html like:

Deploying and running a rheem application on the cluster.
Create a RHEEM_HOME folder on the master node of your cluster.
Include rheem depedencies as shown above in your application's pom file.
Build your application with maven, and package its jars (including all rheem depedencies) into one folder, call it lib. You may need to use maven assembly plugin for that.
Copy the lib directory to your RHEEM_HOME
Copy the rheem.properties file to your RHEEM_HOME
Define a rheem classpath enviroment variable that has RHEEMHOME and SPARKHOME added
Run your application, pointing to rheem.properties and rheem classpath: java -Drheem.configuration="pathto/rheem.properties" -cp "$RHEEM_CLASSPATH" yourapplication.Main

I searched everywhere but didn't find a better user guide for using rheem. Specifically, I am confused about the following things: (1) how to specify rheem.properties when deploy rheem on a spark cluster? (2) how to verify that I have build the environment correctly? Is there any other scripts or examples to deploy Rheem on Spark?

I tried something as I guess, and stated some variables in shell like:

RHEEM_HOME=~/RHEEM_HOME
SPARK_HOME=~/spark-2.1.1-bin-hadoop2.7/bin

but I get errors like:

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/qcri/rheem/core/plugin/Plugin

I am sure the needed jars are all packaged into one single jar. But how can this happen?

By the way, this link maybe not available now: https://s3.amazonaws.com/rheem-qcri/wordcount-distro.zip

atroudi commented 7 years ago

Hi ghandzhipeng,

(1) Rheem uses rheem.properties for running applications on top of rheem and tests; for building purpose there's internal rheem.properties that are used accordingly for testing; (2) Simply when you run mvn clean install it should successfully build Rheem; If you want to generate standalone Rheem jars so you use them as packages for your application.. you can run the following mvn command: mvn clean package -P distro then you will find all Rheem distribution jars inside rheem-distro folder..

You may also need to take a look at the documentation in Rheem github repository: https://github.com/rheem-ecosystem/rheem
Most probably the above error is from maven misconfiguration, somehow it's not generating all Rheem dependencies.

Hope that should help you building your application using RHEEM ;) Good day!

zhipeng93 commented 7 years ago

Hi atroudi,

I have already build Rheem and also build a simple wordCount jar file with rheem-***(like rheem-api, rheem-core etc.) jar files included as dependencies. My question is how to run it on a spark cluster? like this instruction:

Deploying and running a rheem application on the cluster.
1. Create a RHEEM_HOME folder on the master node of your cluster.
2. Include rheem depedencies as shown above in your application's pom file.
3. Build your application with maven, and package its jars (including all rheem depedencies) into one folder, call it lib. You may need to use maven assembly plugin for that.
4. Copy the lib directory to your RHEEM_HOME
5. Copy the rheem.properties file to your RHEEM_HOME
6. Define a rheem classpath enviroment variable that has RHEEMHOME and SPARKHOME added
7. Run your application, pointing to rheem.properties and rheem classpath: java -Drheem.configuration="pathto/rheem.properties" -cp "$RHEEM_CLASSPATH" yourapplication.Main

Also the bug raises when I try to run the wordCount example after setting environments following the above steps.

I am not sure about the format and the content that I need to write in rheem.properties. Operations I have done include: (1) set two variables like:

RHEEM_HOME=~/RHEEM_HOME
SPARK_HOME=~/spark-2.1.1-bin-hadoop2.7/bin

(2) my rheem.properties:

spark.master= yarn

(3) I lanuch the application using the following command,

java -Drheem.configuration=rheem.properties -cp rheem-githua-exmaples.jar com.github.sekruse.wordcount.java.WordCount

Is there something that I misunderstand? Thanks!

sekruse commented 7 years ago

Hi @ghandzhipeng,

looking at your exception, my guess would be that your classpath is not correct. In particular, in

java -Drheem.configuration=rheem.properties -cp rheem-githua-exmaples.jar com.github.sekruse.wordcount.java.WordCount

it seems that you are adding only rheem-githua-exmaples.jar to your classpath. As @atroudi said, you need to build Rheem with mvn clean package -Pdistro. This will collect all relevant jar files in rheem-distro/target/rheem-distro_2.11-0.3.0-distro/rheem-distro_2.11-0.3.0 (folder names can deviate depending on your Scala and Rheem versions). You can copy all those jars into some folder, say lib. Furthermore, you will also need to add your application jars, e.g., rheem-githua-exmaples.jar, to lib. Then run

java -Drheem.configuration=rheem.properties -cp "lib/*" com.github.sekruse.wordcount.java.WordCount

In addition, you can also put your rheem.properties file into lib, then Rheem will automatically use it. In that case, you don't need the -Drheem.configuration=rheem.properties option.

Does this solve your problem?

In any case, you pointed out that the instructions for cluster deployment are not easily understandable or even wrong and we need to update them. Thanks for that! In the best case, we should have a ready-to-go distribution of the Rheem benchmarks.

atroudi commented 7 years ago

Ah okay I guess it may be an issue with spark.master in that case I would suggest two things perhaps to test with: (1) You can test the deployment with spark standalone mode where you can assign spark.master with the master node ip address e.g. spark.master = spark://10.2.5.23:7077. (I am not quite sure at this point whether YARN is currently supported, otherwise we will update shortly) (2) Spark libs (i.e: ~/spark-2.1.1-bin-hadoop2.7/lib) need to be loaded into your environment variables to not have issues with spark dependencies. (Also here Rheem is tested with spark 1.6.*; and Spark2 should be fully supported in the coming release)

zhipeng93 commented 7 years ago

Hi sekruse, hi atroudi,

Thanks for your reply. I tried things you guys said, and have re-deployed a single node spark 1.6.1. But when I try to run the wordcount example follows:

java -Drheem.configuration=rheem.properties -cp "lib/*" com.github.sekruse.wordcount.java.WordCount spark rheem.properties

It gives me another bug:

Exception in thread "main" org.qcri.rheem.core.api.exception.RheemException: Job execution failed.
        at org.qcri.rheem.core.api.Job.doExecute(Job.java:274)
        at org.qcri.rheem.core.util.OneTimeExecutable.tryExecute(OneTimeExecutable.java:23)
        at org.qcri.rheem.core.util.OneTimeExecutable.execute(OneTimeExecutable.java:36)
        at org.qcri.rheem.core.api.Job.execute(Job.java:202)
        at org.qcri.rheem.core.api.RheemContext.execute(RheemContext.java:102)
        at org.qcri.rheem.core.api.RheemContext.execute(RheemContext.java:90)
        at org.qcri.rheem.api.PlanBuilder.buildAndExecute(PlanBuilder.scala:84)
        at org.qcri.rheem.api.DataQuanta.collect(DataQuanta.scala:670)
        at org.qcri.rheem.api.DataQuantaBuilder$class.collect(DataQuantaBuilder.scala:324)
        at org.qcri.rheem.api.BasicDataQuantaBuilder.collect(DataQuantaBuilder.scala:403)
        at com.github.sekruse.wordcount.java.WordCount.execute(WordCount.java:97)
        at com.github.sekruse.wordcount.java.WordCount.execute(WordCount.java:51)
        at com.github.sekruse.wordcount.java.WordCount.main(WordCount.java:115)
Caused by: java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;

Do you know what's going on? (I have added some jars like hadoop-common-2.6.5.jar, guava-22.0.jar, spark-assembly-1.6.1-hadoop2.6.0.jar, etc.) due to some other classNotFounfd errors. By the way, is there any advice on versions of the softwares used?)

sekruse commented 7 years ago

This seems pretty much like a Scala compatibility error that occurs, when multiple versions of Scala's standard library are on the classpath (e.g, 2.10 and 2.11). My guess is, that your spark-assembly-1.6.1-hadoop2.6.0.jar is in conflict to Rheem's used Scala version.

Here is what you can do about it:

  1. Find out, which Scala version is baked into spark-assembly-1.6.1-hadoop2.6.0.jar. You can do this using a small tool called scala-detector that I recently created.
  2. Change Rheem's Scala version accordingly using the script bin/change-scala-version.sh (requires macOS, Linux, or Cygwin).
  3. Rebuild Rheem.
zhipeng93 commented 7 years ago

Thanks! This works now. But I also want to use RHEEM0.2.2-SNAP-SHOT. Is it possible to change the versions from 0.3.1 to 0.2.2?

Or can algorithmgs developped with RHEEM0.2.2-SNAPSHOT(like ML4ALL (https://github.com/rheem-ecosystem/ml4all/blob/master/pom.xml)) work with 0.3.1?

sekruse commented 7 years ago

There are not too many API changes between 0.2.2 and 0.3.x, so it should work. But I can't promise that it will. It's best to give it a try.

In particular, Rheem 0.2.x is always using Scala 2.11. If you want to change that, you will need to modify the main pom.xml accordingly and rebuild Rheem.

zhipeng93 commented 7 years ago

Ok, thanks guys. It works :)