Chapter 5. Deploying Hadoop

copyright by edX PageLinuxFoundationX: LFS103x Introduction to Apache Hadoop

Deploying Hadoop

Learning Objectives

In this chapter you will learn about the following:

The software prerequisites for deploying Hadoop and its ecosystem projects
Three main ways to deploy Hadoop and its ecosystem projects:
- Non-distributed deployment mode
- Pseudo-distributed deployment mode
- Fully distributed deployment mode
Basic HDFS commands to work with files and folders
How to run popular parallel data processing frameworks in different deployment modes
Hadoop deployment automation tools:
- Apache Ambari
- Apache Bigtop.

Getting Hands-On with Hadoop

Since the Hadoop ecosystem is predominantly implemented in Java, it offers an extremely flexible range of deployment options. After all, a Java Virtual Machine (JVM) is all it takes to run the Hadoop ecosystem services. This flexibility can be rather daunting for Hadoop newcomers. In fact, this flexibility is exactly what makes the “Hadoop deployment” such a nebulous term and creates a lot of confusion, unless properly clarified.

Here is a rule of thumb that can help you cut through all the confusion with Hadoop deployments: Hadoop 배포와 관련된 모든 혼란을 줄이는데 유용한 경험적 방법

세 가지 Hadoop 배포 모드

Local mode
There are no APIs aside from what is embedded right in your application.
Pseudo-distributed mode The APIs are provided by genuine Hadoop services, but they are all running on the same host.
Distributed mode The APIs are provided by genuine Hadoop services, running either on a cluster or in the cloud.

You need to see if there is a functional Java version 8 available in your environment.

After all, if there are no Hadoop services providing the APIs, how can various parallel data processing frameworks even run or access data? 로컬모드의 경우 API를 제공하는 Hadoop서비스가 없다면 다양한 병렬 데이터 처리 프레임워크가 어떻게 데이터를 실행하거나 데이터에 액세스 할 수 있을까?

Well, since this is supposed to be a hands-on chapter, let's “deploy” Hadoop in local mode, see how it works, and then get to the explanation part.

Step 1. Download Hadoop version 2.7.3 from the following URL:

http://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

Step 2. Uncompress the Hadoop archive, advertise its location via the HADOOP_HOME variable, and also add it to the PATH:

$ tar –xzf hadoop-2.7.3.tar.gz $ export HADOOP_HOME=pwd/hadoop-2.7.3 $ PATH=$HADOOP_HOME/bin:$PATH

Step 3. Create test data with ZIP codes for a few cities in Silicon Valley that played a major role in Hadoop's history:

$ echo 94304, Palo Alto > /tmp/zip_codes.csv $ echo 94089, Sunnyvale >> /tmp/zip_codes.csv $ echo 94306, Palo Alto >> /tmp/zip_codes.csv

Step 4. Run your first few Hadoop commands working with data in HDFS:

$ hdfs dfs –mkdir input $ hdfs dfs –put /tmp/zip_codes.csv input $ cat input/zip_codes.csv

local mode의 중요한 점은 hdfs즉 하둡 파일 시스템을 그냥 local filesystem을 활용한다는 것이다.

Data Processing Frameworks in Local Mode

By now, you should remember that the processing framework synonymous with Hadoop is MapReduce. It actually comes bundled with Hadoop, so you do not need any separate downloads or installation steps. Hadoop과 동의어로 사용되는 처리 프레임워크는 MR이다. 실제로 MR은 Hadoop과 번들로 제공되므로 별도의 다운로드 또는 설치단계가 필요하지 않는다.
Two other frameworks, Hive and Spark, come as stand-alone packages. Before we start running our first data processing jobs in the local mode, let's make sure we deploy Hive and Spark. Hive와 Spark은 독립 실행형 패키지로 제공된다. 로컬 모드에서 첫 번째 데이터 처리 작업을 시작하기 전에 Hive 및 Spark를 배포해야 한다.

We will start with Spark. The first step is to download the Spark binary package from:

http://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz

$ tar –xzf spark-2.1.0-bin-hadoop2.7.tgz $ export SPARK_HOME=pwd/spark-2.1.0-bin-hadoop2.7 $ PATH=$SPARK_HOME/bin:$PATH

http://archive.apache.org/dist/hive/hive-2.1.1/apache-hive-2.1.1-bin.tar.gz

$ tar –xzf apache-hive-2.1.1-bin.tar.gz $ export HIVE_HOME=pwd/apache-hive-2.1.1-bin $ PATH=$HIVE_HOME/bin:$PATH

Running MapReduce in Local Mode

로컬모드에서는 실행되는 YARN 서비스가 없으므로 이러한 모든 프레임 워크는 단일 자바가상머신에서 모든 처리 작업을 실행하고 로컬파일 시스템에서 오는 데이터를 활용해야 한다.

MR 처리작업을 Hadoop에 제출하려면 Mappers 및 Reducers 구현을 포함하는 jar파일을 만든 다음 Hadoop jar 명령을 사용하여 실제 작업 제출을 해야 한다.

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount input output
$ hadoop fs -getmerge output word_count.txt
$ cat word_count.txt
94089, 1
94304, 1
94306, 1
Alto   2 
Palo   2
Sunnyvale    1

Running Spark in Local Mode

To give you a further taste of the local mode, let's take Spark for a ride:

$ spark-shell
>  sc.textFile("input").flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _).collect().foreach(println)
(94304,,1)
(94306,,1)
(94089,,1)
(Alto,2)
(Palo,2)
(Sunnyvale,1)

Working with Spark from the command line happens through the Spark shell. You enter the Scala code into it and you can observe the output of that code running.

Above, we just manually entered our previous implementation of WordCount in Spark, except the last statement was foreach to make sure that we just print the result out instead of putting it someplace into HDFS. 결과값을 HDFS에 넣는 대신 foreach를 사용해서 출력할 수 있다. spark이 이게 가능하기 때문에 MR보다 유연하다.

Running Hive in Local Mode

et's see how Hive can be used in local mode. Just like Spark, it comes with its own command line shell, which, unsurprisingly, is called hive. You start it up, and then, you execute SQL commands inside of it. All of the prompts that start with the hive> prefix are coming from the Hive shell. But, let's start the shell first, like this:

$ hive
hive> CREATE EXTERNAL TABLE zips (zip int, city String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION ’/tutorial/input';
hive> select city, count(*) from zips group by city;
Palo Alto 2
Sunnyvale 1

Hive can process a lot of file formats as though they were tables, even if the data comes from existing files. This makes the CREATE verb in our first command a little misleading, since we are not really creating anything, but mapping between files that reside in the /tutorial/input folder and a table called zips (this is the LOCATION ’/tutorial/input' part). All we are promising Hive is that any file in than folder will have one integer value (zip) followed by a string value (city), and that those values will be separated by a coma (this the ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ part).

NOTE: If you are actively experimenting with Hive and you are not careful, you may run into an unfortunate situation where you see the following error message from Hive:

Hive is a SQL interface to Hadoop.

$ hive ... metstore-db has an incompatible format with the current version of the software

The easiest way to recover from this situation is to recreate Hive's metastore by running the following command in your terminal:

$ rm -rf metastore_db derby.log $ $HIVE_HOME/bin/schematool -initSchema -dbType derby

After this, Hive should start without showing an error message, but also without any tables you may have previously created.

Single Node Hadoop: Pseudo-Distributed Mode

Sadly, it does not really expose you to the HDFS implementation since it uses your local filesystem pretending it is talking to HDFS. 로컬모드의 경우 HDFS와 통신하는 로컬파일 시스템을 사용하기 때문에 실제로 HDFS 구현에 노출되지 않습니다. YARN도 없습니다.

랩톰에서 HDFS 및 YARN과 동일한 Hadoop 서비스를 모두 실행할 수 있습니다. 세계에서 가장 작은 Hadoop 클러스터(노드가 하나뿐인)가 제공된다. 명령행 상호작용 뿐만 아니라 모든 실행은 천 노드 클러스터에서 실행하는 것처럼 동일한 로직에서 동작한다.

This is what is known as the pseudo-distributed mode. It is pretty easy to set up, and the only "gotcha" could be that it would require a reasonably beefy laptop or desktop, since you will be running a lot of Hadoop services that typically run on different nodes in the cluster in your local environment. If you have anything less that 8G of RAM, you may have a tough time, with the whole system feeling quite sluggish.

How do you put Hadoop in the pseudo-distributed mode? The answer is: one service at a time. Let's start with HDFS.

HDFS in Pseudo-Distributed Mode

HDFS는 모든 파일 시스템 메타 데이터를 관리하는 하나의 NameNode 서비스와 각 파일에 대한 실제 데이터 블록을 저장하는 많은 DataNode서비스로 구성됩니다. 동일한 기계에서 NameNode와 단일 DataNode를 실행할 수 있다. In fact, since Hadoop is elastic, it is possible to add and remove DataNodes to the running HDFS, without affecting it too much (obviously, if you remove a DataNode, the blocks that it stores also go away, unless they were replicated to other DataNodes). Hadoop은 탄력적이므로 DataNode를 실행중인 HDFS에 추가 및 제거할 수 있다. We are not going to do this in this tutorial, but it goes to show that we are not really doing anything special when we are running HDFS in the pseudo-distributed mode; we are just running on the world's smallest cluster: a cluster of a single machine!

Before we can run the NameNode and DataNode services, we have to provide a small amount of XML configuration, to tell HDFS to use just a single DataNode for each block (it uses 3 by default) and to tell NameNode what port to answer on. The first goal is accomplished by placing the following XML in $HADOOP_HOME/etc/hadoop/hdfs-site.xml:

<configuration>
  <property>
     <name>dfs.replication</name>
     <value>1</value>
  </property>
</configuration>

The second goal is accomplished by placing the following XML in $HADOOP_HOME/etc/hadoop/core-site.xml:

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
</property>

다음 단계는 HDFS 시스템 관리자가 클러스터 크기에 관계없이 파일 시스템을 포맷해야 한다는 것이다. 이것은 개념적으로 새로운 하드 드라이브를 구입할 때하는 것과 비슷하다.

$ hdfs namenode –format

now that we are done formatting our filesystem, let's start a NameNode and a DataNode services, so we can store some data:

$ hdfs namenode > name_node.log 2>&1 &
$ hdfs datanode > data_node.log 2>&1 &

This is it! HDFS is up and running. Now would be a really good time to repeat some of your previous hdfs dfs commands.

YARN and Spark in Pseudo-Distributed Mode

YARN consists of one ResourceManager service coordinating all the cluster compute resources, and a lot of NodeManager services that connect to the ResourceManager, receive computational tasks from it, and execute them on the worker nodes. YARN은 모든 클러스터 컴퓨팅 리소스를 조정하는 하나의 ResourceManager서비스와 ResourceManager에 연결하는 많은 NodeManager서비스로 구성되어 있으며, 각각의 worker노드에서 계산 작업을 수신하고 실행합니다. In order to minimize the network traffic and use data that is stored on the node itself, NodeManagers are typically running on the same servers that DataNodes are running on. 네트워크 비용을 최소화 하고 노드 자체에 저장된 데이터를 사용하기 위해 노드 관리자는 일반적으로 DataNodes가 실행되는 서버와 동일한 서버에서 실행된다. Since in the pseudo-distributed mode there is only one node to work with, we will be running the ResourceManager and NodeManager together, and also, in conjunction with both of the HDFS services. 작업할 노드가 하나뿐이므로 ResourceManager와 NodeManager를 함께 실행하고 두 HDFS 서비스와 함께 실행한다.

Unlike HDFS, strictly speaking, YARN does not require any configuration files to get going. HDFS와 달리 YARN은 구성파일을 필요로 하지 않는다. So, let's start both services with the help of the yarn command line utility (this is a third Hadoop command line utility, after hadoop and hdfs, which we have already seen):

$ yarn resourcemanager > resource_manager.txt 2>&1 &
$ yarn nodemanager > node_manager.txt 2>&1 &

$ YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop spark-shell --master yarn --deploy-mode client

scala> sc.textFile("input").flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _). collect().foreach(println)

And there you have your first Spark computation running on a tiny cluster of one node. Interestingly enough, even though YARN itself did not require any configuration files in the pseudo-distributed mode, we still have to tell Spark where to find its configuration files via the YARN_CONF_DIR variable. The rest of the options to the Spark shell are telling it to deploy master using yarn and deploy it in a client mode.

MapReduce in Pseudo-Distributed Mode

Next up on our list of things to try with YARN is MapReduce. At this point, you may be wondering why did we try running Spark ahead of MapReduce. 다음으로 YARN으로 시도할 수 있는 것은 MR이다. Spark를 MapReduce보다 먼저 실행하려고 했던 이유가 궁금할 것이다. After all, MapReduce is the default parallel processing framework for Hadoop. MR은 Hadoop의 기본 병렬 처리 프레임워크이다. While this is true, a variety of historical reasons made MapReduce on YARN require additional configuration.

As you might have guessed, the configuration goes into $HADOOP_HOME/etc/hadoop/yarn-site.xml and looks like:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

This configuration option tells YARN to run an extra service that provides a sorting phase for MapReduce (remember that the output of Mappers needs to be sorted and aggregated before it gets passed to the Reducers). Once again, the reason you have to statically specify this option is mostly historic, and, if anything, it points towards a less than ideal experience developers have with MapReduce, and why Spark is getting so much popularity.

Since we made a configuration change that affects the NodeManagers' behavior, we have to restart it:

$ kill -9 %4
$ yarn nodemanager > node_manager.txt 2>&1 &

Making MapReduce Default to YARN

MR 작업을 실행할 때 마다 YARN에서 실행되도록 기본 설정을 지정해야 한다. MR을 기본으로 YARN의 실행 엔진으로 사용하는 추가 구성 비트를 하나 추가해 보자.

he configuration goes into $HADOOP_HOME/etc/hadoop/mapred-site.xml and looks like:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount input output

And since, by default, Hive uses MapReduce as its execution engine, running it on YARN is now trivial too. You just run:

$ hive
hive> CREATE EXTERNAL TABLE zips (zip int, city String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION ’/tutorial/input';
hive> select city, count(*) from zips group by city;
hive> quit;

Hadoop on a Cluster: Distributed Mode

You would have to decide where to run NameNode, ResourceManager, and where to run DataNodes and NodeManagers (and, if you have other services that need to run on a cluster, in addition to HDFS and YARN, you'll have to plan where to run those as well!).

The next step will be to install the binaries on all the servers and create configuration files, just like we did for our pseudo-distributed example. Your last step would be to start different parts of HDFS and YARN in the right order, and finally get a fully-distributed Hadoop cluster up and running. Of course, if you want your users to have a good time working with this Hadoop cluster, you will have to monitor it for problems (such as DataNodes running out of space, or whole nodes crashing) and correct those, as needed. This is not complicated, but it is pretty boring and time-consuming. That is why the Hadoop ecosystem has developed two main tools to manage the installation-configuration-monitoring cycle: Apache Ambari and Apache Bigtop (or the A&B of Hadoop management).

Apache Ambari is a web application. It provides an intuitive, easy-to-use Hadoop management web UI, backed by its RESTful APIs. Ambari relies on a centralized Web Application and an agent running on each node in the cluster to provide a full set of orchestration services. Amabri는 중앙 집중식 웹 응용 프로그램 집합을 제공한다. In its simplest case, the only thing you would need to do in order to bootstrap a fully-distributed cluster using Amabri is to provide passwordless ssh (1) access to all the nodes in your cluster. Don’t worry - this does not mean you need to compromise on security: the only entity that will have access to your private key will be the Ambari Web application.

Apache Bigtop is an Apache Foundation project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. Instead of relying on its own agent, Bigtop uses industry standard Puppet tools for the installation-configuration-monitoring cycle.

nowol79 / MOOC