sponiro / gradle-hadoop-plugin

Simple startup script generation for hadoop
Apache License 2.0
1 stars 2 forks source link

More detailed instruction, thanks #2

Closed misaka-10032 closed 9 years ago

misaka-10032 commented 9 years ago

Sorry but can you provide a more specific instruction? For example do I need to add extra dependencies of hadoop library, which ones to add? Do I need to specify the HADOOP_HOME? By the way I used brew install hadoop to install hadoop and global variable HADOOP_HOME is not set. I also want to integrate with Intellij, but find hadoop dependencies are missing.

p.s. I am a student learning hadoop. I have now setup hadoop, and want to find a nice gradle plugin to write map-reduce programs. At least I want to run the WordCount with gradle given the Java source code.

Thank you very much if you can help!

sponiro commented 9 years ago

I try to help but this was a very short-lived side project. I barely got hadoop to work. That being said, you need a working hadoop installation to make use of this plugin. Its just a helper to create a script that puts all your dependencies into the script which you can call. Unfortunately my example project got lost so I can't show you how to start. But you need to add the hadoop dependencies to your build.gradle to start coding in Intellij. Be sure to get the version matching your hadoop installation from maven - thats really important. You probably want them to add as a provided dependency, see here: http://forums.gradle.org/gradle/topics/how_do_i_best_define_dependencies_as_provided About brew: Sorry, never used it. I don't know about it. Remember: This plugin just creates a shell script to ease the burden of calling your program into hadoop. You can just put it into your build.gradle and execute it. Have a look at the script to see if thats what you expect.

Hope that helps!

misaka-10032 commented 9 years ago

Thank you for patience. Actually I still cannot get it run. I have successfully compiled the jar, but cannot have it run. And I find something strange: I both run hadoop jar lib/XX.jar XXX in terminal and run that in Intellij, and got different error prompts. I make sure that I have setup hadoop correctly because I can run its sample. I'm also sure I use the match version(2.5.1) from local setup and the declaration in gradle. Don't know why.

Can you please provide a simple example (although I know you cannot find it)? If that takes you a long time, please inform me, thanks! I will consider other solutions because deadlines are driving me crazy.

misaka-10032 commented 9 years ago

Sorry, update my progress: I managed to run hadoop jar lib/XX.jar XXX under the root folder of my WordCount project in the terminal, but still cannot have it run in Intellij. I find unixHadoopStartScript.txt in your plugin:

exec hadoop jar "${appJar}" <% if (mainClassName) print mainClassName %> -libjars "\${HADOOP_LIBJARS}" "\$@"

Although I don't understand every detail of this command, I think it's nearly the same as what I did in the terminal, but the strange thing is, I can have it run at terminal, but cannot have it run at Intellij. Here is the error message:

Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
    at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
    at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
    at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
    at org.apache.hadoop.mapred.JobClient.init(JobClient.java:470)
    at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:449)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:832)
    at com.rocky.WordCount.main(WordCount.java:55)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

Any idea about why? Thanks!

p.s. It seems the hadoop command run in the terminal is not the same as that by your gradle script? BTW what I run in terminal is hadoop jar gopher-1.0.jar com.rocky.WordCount /data/doc1 /output-2.

sponiro commented 9 years ago

I think you misunderstand the purpose of this plugin. It does not execute your program on its own. It creates a shell script to be executed. The unixHadoopStartScript.txt is a template for that which is filled with variables. When you execute on of the plugins tasks you get a shell script in your build folder. Also the dependencies are at the right place. Hadoop needs all the dependencies you used and your program itself to run properly. This is accomplished by the generated script. If you want to run it from Intellij you have to build this stuff on your own and probably by hand (or you may find a plugin for Intellij). It is not hard to create the correct call but boring and error prone. This is why I build a plugin to create a fully startable hadoop program. Just look into the build dir after you executed one of the plugins tasks.

misaka-10032 commented 9 years ago

I see. I misunderstood exec as a gradle command, which I deemed may execute it for me. Now it's cleared. Very helpful. Thank you very much.

misaka-10032 commented 9 years ago

I still meet some problems using this plugin. So you mean if I want to generate the script, I need to run

gradle hadoopInstall

But I encounter the problem:

:hadoopStartScripts

FAILURE: Build failed with an exception.

* What went wrong:
Could not resolve all dependencies for configuration ':runtime'.
> Could not find org.apache.hadoop:hadoop-common:2.5.1.

Here's my build.gradle

version = '1.0'

buildscript {
    repositories {
        mavenCentral()
        maven {
            url 'http://dl.bintray.com/sponiro/gradle-plugins'
        }
    }
    dependencies {
        classpath group: 'de.fanero.gradle.plugin.hadoop', name: 'gradle-hadoop-plugin', version: '0.2'
    }
}

apply plugin: 'hadoop'

dependencies {
//    testCompile group: 'junit', name: 'junit', version: '4.11'
    compile 'org.apache.hadoop:hadoop-common:2.5.1'
}

hadoop {
    buildSubDir = "hadoop"
    mainClassName = 'com.rocky.WordCount'
}

BTW, when I run

gradle dependencies

it prompts me with

archives - Configuration for archive artifacts.
No dependencies

compile - Compile classpath for source set 'main'.
\--- org.apache.hadoop:hadoop-common:2.5.1 FAILED

default - Configuration for default artifacts.
\--- org.apache.hadoop:hadoop-common:2.5.1 FAILED

runtime - Runtime classpath for source set 'main'.
\--- org.apache.hadoop:hadoop-common:2.5.1 FAILED

testCompile - Compile classpath for source set 'test'.
\--- org.apache.hadoop:hadoop-common:2.5.1 FAILED

testRuntime - Runtime classpath for source set 'test'.
\--- org.apache.hadoop:hadoop-common:2.5.1 FAILED

I don't know if I wrote something wrong in gradle.build. I also tested using java plugin, and I make sure compile 'org.apache.hadoop:hadoop-common:2.5.1' is correct. Thank you very much if you can help.

sponiro commented 9 years ago

You need a second repositories block which is not inside the buildscript block. The buildscript dependencies are not valid outside. So you probably are missing something like:

repositories {
    mavenCentral()
}
misaka-10032 commented 9 years ago

Oh, yeah, I now get a step further, but still have some problems. The script generated looks like:

exec hadoop jar "../lib/gopher-1.0.jar" com.rocky.WordCount -libjars "${HADOOP_LIBJARS}" "$@"

I have solved the issue that hadoop takes -libjars as an argument for may application by using GenericOptionsParser, but another issue comes:

File does not exist: hdfs://localhost:9000/Users/rocky/College/homework/distributed-systems/gopher/build/hadoop/install/gopher/lib/gopher-1.0.jar

It seems hadoop still takes the local jars as are on hdfs. I don't know how hadoop interprets the logic, but I see the HADOOP_LIBJARS you defined are in relative path:

HADOOP_LIBJARS="../lib/gopher-1.0.jar,..

Any suggestions? I don't think it's a nice idea to copy gopher-1.0.jar to hdfs://Users/rocky/College/homework/distributed-systems/gopher/build/hadoop/install/gopher/lib/gopher-1.0.jar

Thank you very much!

sponiro commented 9 years ago

As far as I recall, the -libjars option puts the jars into the distributed cache. Else the workers would not have access to it. I suspect you should activate the exportHadoopClasspath option. The local jvm needs it to find the jars but it is a shot in the dark.

misaka-10032 commented 9 years ago

I finally solved it following this site. And sorry again, now I have another question: what if I write several hadoop applications in the same project?

For example, my own task A1 and A2 build and run my two hadoop applications.

hadoop {
    mainClassName: 'myMain'
}

task A1(dependsOn: hadoopInstall) {
    // run the script you generated
}

task A2(dependsOn: hadoopInstall) {
    // run the script you generated
}

If I do it in this way, both A1 and A2 uses the same configuration specified in hadoop extension. But what I want to do is to specify different mainClasses in task A1 and A2, then how do I specify hadoop in this case?

p.s. It's not a issue of your plugin, but an issue of my customization. It's very kind of you to answer my questions, so I stretch out for your help again. Thank you very much.

sponiro commented 9 years ago

I have never gone this far with the plugin. Maybe you can go with a multi-project build in gradle. That way you could specify a different configuration per subproject.

misaka-10032 commented 9 years ago

Actually I've asked it on stackoverflow. If only a single extension is used (as this plugin), the plugin doesn't support multi-app in a single project. Quick solution is to use multiple projects. Another ugly one is to copy the hadoopInstall task out and use the self-defined properties. Anyway, thank you very much! Now I can focus on my real hadoop application part.