Closed zhipeng93 closed 7 years ago
Hi ghandzhipeng,
(1) Rheem uses rheem.properties for running applications on top of rheem and tests; for building purpose there's internal rheem.properties that are used accordingly for testing;
(2) Simply when you run mvn clean install
it should successfully build Rheem; If you want to generate standalone Rheem jars so you use them as packages for your application.. you can run the following mvn command: mvn clean package -P distro
then you will find all Rheem distribution jars inside rheem-distro folder..
You may also need to take a look at the documentation in Rheem github repository: https://github.com/rheem-ecosystem/rheem
Most probably the above error is from maven misconfiguration, somehow it's not generating all Rheem dependencies.
Hope that should help you building your application using RHEEM ;) Good day!
Hi atroudi,
I have already build Rheem and also build a simple wordCount jar file with rheem-***(like rheem-api, rheem-core etc.) jar files included as dependencies. My question is how to run it on a spark cluster? like this instruction:
Deploying and running a rheem application on the cluster.
1. Create a RHEEM_HOME folder on the master node of your cluster.
2. Include rheem depedencies as shown above in your application's pom file.
3. Build your application with maven, and package its jars (including all rheem depedencies) into one folder, call it lib. You may need to use maven assembly plugin for that.
4. Copy the lib directory to your RHEEM_HOME
5. Copy the rheem.properties file to your RHEEM_HOME
6. Define a rheem classpath enviroment variable that has RHEEMHOME and SPARKHOME added
7. Run your application, pointing to rheem.properties and rheem classpath: java -Drheem.configuration="pathto/rheem.properties" -cp "$RHEEM_CLASSPATH" yourapplication.Main
Also the bug raises when I try to run the wordCount example after setting environments following the above steps.
I am not sure about the format and the content that I need to write in rheem.properties. Operations I have done include: (1) set two variables like:
RHEEM_HOME=~/RHEEM_HOME
SPARK_HOME=~/spark-2.1.1-bin-hadoop2.7/bin
(2) my rheem.properties:
spark.master= yarn
(3) I lanuch the application using the following command,
java -Drheem.configuration=rheem.properties -cp rheem-githua-exmaples.jar com.github.sekruse.wordcount.java.WordCount
Is there something that I misunderstand? Thanks!
Hi @ghandzhipeng,
looking at your exception, my guess would be that your classpath is not correct. In particular, in
java -Drheem.configuration=rheem.properties -cp rheem-githua-exmaples.jar com.github.sekruse.wordcount.java.WordCount
it seems that you are adding only rheem-githua-exmaples.jar
to your classpath. As @atroudi said, you need to build Rheem with mvn clean package -Pdistro
. This will collect all relevant jar files in rheem-distro/target/rheem-distro_2.11-0.3.0-distro/rheem-distro_2.11-0.3.0
(folder names can deviate depending on your Scala and Rheem versions). You can copy all those jars into some folder, say lib
. Furthermore, you will also need to add your application jars, e.g., rheem-githua-exmaples.jar
, to lib
. Then run
java -Drheem.configuration=rheem.properties -cp "lib/*" com.github.sekruse.wordcount.java.WordCount
In addition, you can also put your rheem.properties
file into lib
, then Rheem will automatically use it. In that case, you don't need the -Drheem.configuration=rheem.properties
option.
Does this solve your problem?
In any case, you pointed out that the instructions for cluster deployment are not easily understandable or even wrong and we need to update them. Thanks for that! In the best case, we should have a ready-to-go distribution of the Rheem benchmarks.
Ah okay I guess it may be an issue with spark.master
in that case I would suggest two things perhaps to test with:
(1) You can test the deployment with spark standalone mode where you can assign spark.master
with the master node ip address e.g. spark.master = spark://10.2.5.23:7077
. (I am not quite sure at this point whether YARN is currently supported, otherwise we will update shortly)
(2) Spark libs (i.e: ~/spark-2.1.1-bin-hadoop2.7/lib
) need to be loaded into your environment variables to not have issues with spark dependencies. (Also here Rheem is tested with spark 1.6.*; and Spark2 should be fully supported in the coming release)
Hi sekruse, hi atroudi,
Thanks for your reply. I tried things you guys said, and have re-deployed a single node spark 1.6.1. But when I try to run the wordcount example follows:
java -Drheem.configuration=rheem.properties -cp "lib/*" com.github.sekruse.wordcount.java.WordCount spark rheem.properties
It gives me another bug:
Exception in thread "main" org.qcri.rheem.core.api.exception.RheemException: Job execution failed.
at org.qcri.rheem.core.api.Job.doExecute(Job.java:274)
at org.qcri.rheem.core.util.OneTimeExecutable.tryExecute(OneTimeExecutable.java:23)
at org.qcri.rheem.core.util.OneTimeExecutable.execute(OneTimeExecutable.java:36)
at org.qcri.rheem.core.api.Job.execute(Job.java:202)
at org.qcri.rheem.core.api.RheemContext.execute(RheemContext.java:102)
at org.qcri.rheem.core.api.RheemContext.execute(RheemContext.java:90)
at org.qcri.rheem.api.PlanBuilder.buildAndExecute(PlanBuilder.scala:84)
at org.qcri.rheem.api.DataQuanta.collect(DataQuanta.scala:670)
at org.qcri.rheem.api.DataQuantaBuilder$class.collect(DataQuantaBuilder.scala:324)
at org.qcri.rheem.api.BasicDataQuantaBuilder.collect(DataQuantaBuilder.scala:403)
at com.github.sekruse.wordcount.java.WordCount.execute(WordCount.java:97)
at com.github.sekruse.wordcount.java.WordCount.execute(WordCount.java:51)
at com.github.sekruse.wordcount.java.WordCount.main(WordCount.java:115)
Caused by: java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;
Do you know what's going on? (I have added some jars like hadoop-common-2.6.5.jar, guava-22.0.jar, spark-assembly-1.6.1-hadoop2.6.0.jar, etc.) due to some other classNotFounfd errors. By the way, is there any advice on versions of the softwares used?)
This seems pretty much like a Scala compatibility error that occurs, when multiple versions of Scala's standard library are on the classpath (e.g, 2.10
and 2.11
). My guess is, that your spark-assembly-1.6.1-hadoop2.6.0.jar
is in conflict to Rheem's used Scala version.
Here is what you can do about it:
spark-assembly-1.6.1-hadoop2.6.0.jar
. You can do this using a small tool called scala-detector that I recently created.bin/change-scala-version.sh
(requires macOS, Linux, or Cygwin).Thanks! This works now. But I also want to use RHEEM0.2.2-SNAP-SHOT. Is it possible to change the versions from 0.3.1 to 0.2.2?
Or can algorithmgs developped with RHEEM0.2.2-SNAPSHOT(like ML4ALL (https://github.com/rheem-ecosystem/ml4all/blob/master/pom.xml)) work with 0.3.1?
There are not too many API changes between 0.2.2 and 0.3.x, so it should work. But I can't promise that it will. It's best to give it a try.
In particular, Rheem 0.2.x is always using Scala 2.11. If you want to change that, you will need to modify the main pom.xml
accordingly and rebuild Rheem.
Ok, thanks guys. It works :)
Hi, I want to run some code using rheem on with Spark following docs from http://da.qcri.org/rheem/download.html like:
I searched everywhere but didn't find a better user guide for using rheem. Specifically, I am confused about the following things: (1) how to specify rheem.properties when deploy rheem on a spark cluster? (2) how to verify that I have build the environment correctly? Is there any other scripts or examples to deploy Rheem on Spark?
I tried something as I guess, and stated some variables in shell like:
but I get errors like:
I am sure the needed jars are all packaged into one single jar. But how can this happen?
By the way, this link maybe not available now: https://s3.amazonaws.com/rheem-qcri/wordcount-distro.zip