glennklockwood / hpchadoop

Hadoop for Traditional HPC Users
http://users.sdsc.edu/~glockwood/comp/hadoopcluster.php
22 stars 5 forks source link

ppn on Gordon #1

Closed zonca closed 11 years ago

zonca commented 11 years ago

Hi, I'm testing hadoop on Gordon, is it possible to run it at ppn higher than 1? Thanks!

glennklockwood commented 11 years ago

In short, no--each Hadoop node runs a single TaskTracker and DataNode, and these services bind to specific TCP ports. Trying to run multiple Hadoop nodes on a single physical node would fail because only the first node would be able to bind to the appropriate ports.

A more abstract answer is that assigning multiple ppn per Hadoop node doesn't really make sense because Hadoop is supposed to tackle data-intensive, not compute-intensive, jobs. Thus, even though you may have 16 cores per node on Gordon, the bottleneck is the disk I/O. On each Gordon node, the SSD I/O is all shoved through a single QDR4X Infiniband link, so it doesn't make sense to run more than one Hadoop node per compute node unless each SSD has its own link.

Even then, you can assign multiple tasks to a single physical compute node when you launch your Hadoop job. In fact, this often causes better performance. For example, I have a Hadoop mapper that parses a VCF file containing genetic information that performs best when I assign two task maps for each physical Hadoop node (e.g., an 8-node Xeon X5550 Hadoop cluster on FutureGrid Hotel):

image

zonca commented 11 years ago

thanks, this is very interesting, how do you configure hadoop to run multiple tasks per node? in this configuration does hadoop uses the other cpus on the same node?

On Mon, Jul 15, 2013 at 11:54 PM, Glenn K. Lockwood < notifications@github.com> wrote:

In short, no--each Hadoop node runs a single TaskTracker and DataNode, and these services bind to specific TCP ports. Trying to run multiple Hadoop nodes on a single physical node would fail because only the first node would be able to bind to the appropriate ports.

A more abstract answer is that assigning multiple ppn per Hadoop node doesn't really make sense because Hadoop is supposed to tackle data-intensive, not compute-intensive, jobs. Thus, even though you may have 16 cores per node on Gordon, the bottleneck is the disk I/O. On each Gordon node, the SSD I/O is all shoved through a single QDR4X Infiniband link, so it doesn't make sense to run more than one Hadoop node per compute node unless each SSD has its own link.

Even then, you can assign multiple tasks to a single physical compute node when you launch your Hadoop job. In fact, this often causes better performance. For example, I have a Hadoop mapper that parses a VCF file containing genetic information that performs best when I assign two task maps for each physical Hadoop node (e.g., an 8-node Xeon X5550 Hadoop cluster on FutureGrid Hotel):

[image: image]https://f.cloud.github.com/assets/4400938/801146/f50855ec-ed98-11e2-9b44-a40759eba21e.png

— Reply to this email directly or view it on GitHubhttps://github.com/glennklockwood/hpchadoop/issues/1#issuecomment-21006777 .

glennklockwood commented 11 years ago

You can set a "suggested" number of map/reduce tasks per job using the mapred.reduce.tasks and mapred.map.tasks parameters, but these values are strictly suggestions and may not be used based on the nature of the input data (see here for more info).

My VCF mapper (no reduce step) job submission looks something like this:

HADOOP_STREAMING_JAR=/opt/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar
hadoop jar $HADOOP_STREAMING_JAR \
    -D mapred.reduce.tasks=0 \
    -D mapred.map.tasks=8 \
    -mapper "$HOME/sds141/python2.7/bin/python $PWD/mapper.py" \
    -input data/test-big.vcf \
    -output data/output \
    -cmdenv PYTHON_HOME=$PYTHON_HOME \
    -cmdenv LD_LIBRARY_PATH=$LD_LIBRARY_PATH

You can confirm the number of mappers and reducers by using hadoop job -status *jobid*.