Chapter 3. Core Hadoop Architecture

copyright by edX PageLinuxFoundationX: LFS103x Introduction to Apache Hadoop

Introduction & Learning Objectives

Learning Objectives

The Hadoop Distributed File System(HDFS) and its components : NameNode, DateNode, and Clients
Yet Another Resource Negotiator(YARN) and its components : ResourceManager, NodeManager, and ApplicationManager
Additional YARN and HDFS features: HA, resource request model, schedulers
Topology of Hadoop cluster

What is HDFS?

The data set consists of files, but it is also large(단순 하드 디스크에 담아 낼 수 없는 정도의 대용량 파일)
multiple servers에 data file들을 나눠서 저장해야 한다.
핵심은 여러 서버에 나눠 저장되는 대용량 파일들이 매우 중요한 데이터 set이라는 것과 여러 서버에 저장되는 파일들은 단일 local서버 파일 시스템처럼 읽고, 쓰고가 가능해야 하며 한 서버가 다운이 되어도 이 기능은 유지가 되어야 한다.

HDFS의 핵심 디자인은 바로 두 가지이다.

user-friendly interface (feels like a regular filesystem)
it does that while maintaining a very familiar

This is precisely what HDFS (Hadoop distributed file system) has been designed for, and it does that while maintaining a very familiar, user-friendly interface.

HDFS

Divide files into big blocks and distribute across the cluster
Store multiple replicas of each block for reliability
Programs can ask "where do the pieces of my file live?"
It feels like a regular filesystem

HDFS Components

NameNode (one or two per cluster)

Represents a single filesystem namespace rooted at /
Is the master service of HDFS
Determines and maintains how the chunks of data are distributed across the DataNodes
Actual data never resides here, only metadata (e.g., maps of where blocks are distributed).

DataNode (as many as you want per cluster)

Stores the chunks of data, and is responsible for replicating the chunks across other DataNodes
Default number of replicas on most clusters is 3 (but it can be changed on a per-file basis)
Default block size on most clusters is 128MB.

A default replication factor of 3 for every block is good for two reasons. First of all, this level of replication means you can afford to have one node failing, and still have fault tolerance with two nodes left. Secondly, it allows for the third copy to be placed on a different rack in the datacenter, thus maximizing the chances that at least one copy of the block will survive all major catastrophic events

3번째 복사본은 데이터센터 다른 rack에 저장이 되는 것은 내부적인 구현이 그런가 보네.. 이건 한 번 찾아 봐야 할 듯...

HDFS Architecture

Hadoop cluster you will have HDFS and YARN's daemons running.

worker nodes 1~7번까지 파란색 NodeManager가 바로 YARN이다.

NameNode

Metadata
Namespace
Block map

Once the NameNode is up and running, it makes itself available on the network. and a whole bunch of DateNodes initiate connnections to it. This is very important to remember:

DateNodes connect to a NameNode, and not the other way around.

매우 중요한 디자인이다. NameNode는 가만히 있고 DataNode들이 붙는 형태이다. 이런 디자인은 언제라도 DataNode를 확장하기 용이한 클러스터를 디자인의 핵심이다.

NameNode doesn't really have to care, it just keeps track of what DataNodes are online, and what DataNodes are providing storage services in form of block information.

DataNode 2가 다운이 되면 block 1-2-3에 대한 replication factor가 3->2로 바뀌면서 깨지게 된다. NameNode는 DateNode3에 block 1-2-3를 복사하도록 한다.

위에서 살펴봤듯이 HDFS 상호 동작은 NameNode를 통하고 모든 데이터 전송은 NameNode를 통하는 것처럼 보인다. 이런 동작은 성능면에서 그리 좋은 설계는 아니다. 하나의 NameNode당 수백 개의 DataNode가 이런식으로 동작하지 않을 것이다. HDFS 디자이너는 NameNode가 가능한 빠르게 데이터 전송 비스니스를 벗어날 수 있도록 고안해 냈다. 클라이언트가 필요한 정보를 모두 가져 오거나 DataNode에서 직접 가져 오거나 DataNode에 직접 보낼 수 있도록 했다.

핵심은 NameNode는 모든 메타 데이터 값만 제어를 하고 클라이언트는 NameNode를 통해 파일 메타정보를 전달받아 어떤 DataNode와 통신해야 하는지 파악한 후에 실제 데이터 유통을 클라이언트가 DataNode와 직접하게 된다. 아래 그림을 참고하자.

클라이언트가 NameNode에 파일을 HDFS에 추가하라는 요청을 보낸 다음 기본적으로 NameNode로 응답을 받는다. 그런 다음 클라이언트는 가지고 있는 모든 block들에 대한 block ID값과 해당 블럭이 저장되어야 할 DataNode리스트를 NameNode에 요청한다. 클라이언트는 NameNode로 부타 해당 정보를 얻게되면 이후부터 NameNode는 더이상 할 일이 없다. 이후부터는 클라이언트가 전달받은 목록을 참고해서 데이터를 직접 처리하게 된다. 위 그림에서 DataNode 1에 오는 첫 번째 블럭이 오고 DataNode가 디스크에 쓰기 시작합니다. 그러면 replication pipeline이 시작이 됩니다. DataNode 1번이 2번으로 2번이 3번으로 쓰게 됩니다. 중요한 것은 모든 복제는 첫 번째 DataNode에 의해 수행되므로 NameNode는 어떠한 것도 관여하지 않는다.

HDFS가 사용하는 복제 전략은 상당히 정교하다. HDFS는 디스크 시스템과 네트워크가 모두 다른 시간에 실패한다고 가정하기 위해 설계되었다. 결과적으로 HDFS는 이러한 장애를 자동으로 투명하게 처리하도록 설계되었다. 이를 통해 서로 다른 DataNode에 걸쳐 자동으로 데이터를 복제 할뿐만 아니라 블록이 있는 DataNode에서 데이터 손상 가능성을 고려한다.

아래는 이러한 HDFS의 block 복제 전략을 설명한다.

블록 1를 쓰는 경우 첫 번째 복제본은 클러스터의 다른 랙에 있는 DataNode로 이동한다. 이를 통해 추가 네트워크 트래픽의 단점은 있지만 블록 가용성을 극대화 한다. 서버가 다운되면 같은 정전 또는 이와 유사한 상황으로 인해 동일한 랙 내에서 작동할 확률이 훨씬 높아진다. 이제 블록이 실제로 랙2에 도달하면 세 번째 복제본은 동일한 랙에 있는 서버로 이동하여 쓰기 비용을 최소화 할 수 있다. 블록이 네트워크를 통해 이동하지 않아도 되며 랙의 외부에 이미 블록 사본을 보유하고 있기 때문이다.

HDFS Multi-Tenant Controls

multi-tenant 환경에서 많은 사용자가 공용자원을 함께 사용하기 위해서는 서로 다른 권한레벨이 필요하다. 이를 Hadoop in a multi-tenant environment라고 부른다. It allows data to be protected by security controls, and it allows per-user quotas, so that one user does not eat up all of the storage capacity;

For security
- HDFS supports the notion of users and groups of users. Please note that these are different accounts from what you may have provisioned as Linux or LDAP users on the servers running HDFS services
- HDFS offers classic POSIX filesystem permissions for controlling who can read and write ( e.g., -rwxr-xr--)
- HDFS also offers extended Access Control Lists (ACL) for richer scenarios
- Outside core HDFS, the Apache Ranger HDFS plugin offers centralized authorization policies and audit
For quotas
- HDFS provides easy-to-understand data size quotas
- HDFS also gives you additional options for controlling the number of files.

An Elephant in the Room: Isn't NameNode a SPOF?

In Hadoop, prior to version 2.0, the NameNode was a single point of failure (SPOF).
The entire cluster would become unavailable if the NameNode failed or became unreachable.
Even maintenance events such as software or hardware upgrades on the NameNode machine would result in periods of cluster downtime.

The HDFS NameNode High Availability (HA) feature eliminates the NameNode as a single point of failure. It enables a cluster to run redundant NameNodes in an Active/Standby configuration.

NameNode HA enables fast failover to the Standby NameNode in response to a failure, or a graceful administrator-initiated failover for planned maintenance.

There are two ways of configuring NameNode HA. Using the Web UI is the easiest way.
Manually editing the configuration files and starting or restarting the necessary daemons is also possible. However, manual configuration of the NameNode HA is not compatible with UI-based administration.

NameNode HA 어찌하고 있는지 .. 확인해 보자
RAID설정 이론상으로는 HDFS 상에서는 RAID설정을 미러링, 스프릿도 필요없는 RAID 0로 해도 무난하나 운영상 불편함으로 .. 가령 disk fault시 재설치가 필요하다던가.. 등등.. 이런 이유로 RAID 5를 사용했었다. 하지만, 10G 네트워크 망을 사용함에 따라서 RAID 5는 computing 비용이 커지다 보니 네트워크 가용범위를 전부 소진하지 못하는 병목이 되는 현상이 발생했다. 따라서 RAID 1으로 사용하고 있다.

Heterogeneous Storage

Hadoop은 동종의 스토리지 환경(Homogenous storage landscape)를 대상으로 시작되었다. In fact, part of the success of HDFS was due to how effective it was in pooling massive amounts of commodity hard drives across the cluster, without requiring any expensive hardware or software solutions such as RAIDs. HDFS의 구성 조건들..

DataNode is a single storage unit
Storage medium is uniform (저장매체가 균일하다)
Any other storage medium is not exposed to HDFS clients.

하지만 현실은, 데이터 센터의 스토리지 배포전략은 각 물리적 서버마다 사용 가능한 다른 저장 매체가 있을 수 있으며 파일 시스템이 해당 작업에 가장 적합한 파일 시스템을 선택해야 한다. 궁극적으로 사용자가 이러한 선택 사항을 미세 조정할 수 있게 합니다. 저장소의 종류는 아래와 같을 수 있습니다.

Disk
SSDs
Memory

remote storage options : EBS, SAN, etc.

HDFS Storage Before and Now

HDFS가 이러한 다양한 스토리지 문제를 어떻게 대처해 왔으며 모든 종류의 저장매체를 사용하고 있는지 살펴보자.

The old HDFS architecture assumed that all the storage is locally attached to the DataNode and there is no other way to access it, but to read it from the DataNode itself. The access model was rigid and it looked a lot like:

attach한 storage만 DataNode에서만 접근이 가능하고 읽는게 가능했다.

The new architecture still allows for that mode, but it also allows for essentially a pass-through where the client can read directly from certain types of storage mediums. This essentially allows for heterogenous storage federation data architecture that looks something like this:

This is, of course, a brand new frontier that is still not quite settled by the engineers working on HDFS. There is very active development happening that is sure to enable different Storage Types (RAM_DISK, SSD, DISK, ARCHIVE) and Storage Profiles (HOT, WARM, COLD, All_SSD, One_SSD, LAZY_PERSIST), all in support of seamless analytics.

What Is YARN?

First of all, this process is limited by running just one copy of the algorithm on a single server. 프로세스는 단일서버에서 알고리즘 복사본 하나만 실행하여 제한한다. Second of all, you are making all the required data be transferred over the networks (remember: you do not need all the data, just the top 1000 records). 필요한 모든 데이터를 네트워크를 통해 전송한다.

When Hadoop came onto the scene, it radically improved the scalability of these types of computation by:

Allowing computation to run on each node in the cluster
Using the same nodes for computation that HDFS was using for storing blocks
Making sure that pieces of computation got scheduled on the nodes that hosted blocks required for computation locally, thus avoiding the need to send data over the network.

The mantra of Hadoop has always been: "Don't bring data to compute, bring compute to data", and the part of Hadoop that made this mantra a reality is YARN (Yet Another Resource Negotiator).

Hadoop YARN의 핵심은 바로 "Don't bring data to compute, bring compute to data" !!

YARN Architectural Components

YARN is a distributed resource management and scheduling framework written in Java, running on the same nodes in the cluster that HDFS runs. It is composed of the following daemons:

Resource Manager (one or two per cluster) that provides
- Global resource scheduler
- Hierarchical queues
Node Manager (running next to the DataNode)
- Encapsulates RAM and CPU resources available on a worker node into units called YARN containers
- Manages the lifecycle of YARN containers
- Container resource monitoring
Application Master (created on-demand)
- Manages application scheduling and task execution
- Typically, specific to a higher-level framework (e.g. MapReduce Application Master).

you are running just one or two Resource Managers per cluster is your tolerance for YARN outages.

one Resource Manager - if it goes down, your cluster can no longer accept new jobs and provide some of the YARN capabilities for managing jobs that are running (even though the existing jobs running via Node Managers will be fine). two Resource Managers running in an Active/Standby configuration, the Standby can take over in cases where the Active one fails or needs to be brought down for maintenance.

The interaction of all these parts of YARN will look something like this, and we will spend the rest of this chapter explaining the internals of its architecture:

YARN Cluster Architecture

The NodeManager reacts to any valid request like that, by allocating the required amounts of CPU and RAM capacity, and spinning off a YARN container, that can now use that much CPU and RAM for running user-submitted code. This container should not be confused with Linux containers.

The first piece of user-submitted code that starts running inside of a freshly minted YARN container is an ApplicationMaster.

All requests for resource and containers must be granted and executed by the ResourceManager and NodeManager components.
Once the ApplicationMaster is set up and running, it is responsible for negotiating appropriate resources in the form of containers from the ResourceManager, working with NodeManagers to execute and monitor containers and their resource consumption, determining the resource allocation per container, and finally, providing fault tolerance for applications.

ApplicationMaster의 역할이 정확하게 뭔지 모르겠다...

Resource Requests: The Lifeblood of YARN

ResourceManager acts as a traffic cop directing Resource Requests where they need to go and making sure that the cluster, like the city.
YARN에서의 ResourceManager는 복잡한 도시의 교통경찰 역할을 한다.
In order for us to appreciate the mechanics of YARN resources scheduling, let's start by looking at the structure of a ResourceRequest:

priority
resourceName
capability
numContainers.

각 ResourceRequest 핵심은 YARN의 ResourceManager가 이해할 수 있는 정교한 리소스 예약 프로그램이라고 할 수 있다. 각 요청은 기능을 지정하여 특정 양의 자원(memory, cpu.. etc)을 요청할 수 있다. 해당 자원을 클러스터의 특정 서버 서브 세트의 응용 프로그램에 제공해야 하는 경우 resourceName을 지정하여 수행할 수 있다.

pecifying capability and resourceName tells YARN where to put the containers. How many of them to allocate is specified via the numContainers field of the Resource Request. Finally, an overall priority of each request is given as well, priority.

batch of 2 resource requests
batch is requesting a total of 4 containers
We are asking YARN to give us 3 containers of 2GB of RAM and one CPU core with a priority 0.
Among those 3 containers, we are requesting that one be placed exactly on host01, one be placed anywhere in rack0 and one be placed anywhere in a datacenter.
our 4th container is requested with priority 1 and with 4GB of RAM and one CPU core.

위 요청을 ResourceManager에 의해서 거부될 수 있다. (for example, if there is no memory or cores left on host01) 이럴 경우 Resource Requests는 실패한 것으로 간주되고 이를 복구하는 것은 ApplicationMaster가 책임진다.

Shared Infrastructure Challenges

realistic scenarios, even a single user will be running different workloads at the same time, and, on top of that, there could be different users accessing the cluster at the same time.
This is known as a shared, multi-tenant Hadoop cluster and YARN provides excellent capabilities for resource isolation, resource management and security. In short, YARN guarantees that different workloads will not be stepping on each other's toes.

Multi-Tenancy with Capacity Scheduler

멀티 테넌트 환경에서 리소스 요청이 승인되었다는 것은 무슨 의미일까?
YARN 관점에서 보면 ResourceManager 구성 요소인 Scheduler가 제약 조건에 부합한다는 것이고 클러스터에 리소스를 할당하기로 합의가 되었다는 의미이다.

There are at least two popular options for YARN scheduling:

CapacityScheduler
FairScheduler

The CapacityScheduler was created to address a perennial enterprise problem.
Sharing clusters across organizations necessitates strong support for multi-tenancy, since each organization must be guaranteed capacity and safeguards to ensure the shared cluster is impervious to single rogue application or user or combination thereof.
CapacityScheduler provides a stringent set of limits to ensure that a single application or user cannot consume a disproportionate amount of resources in the cluster. Also, the CapacityScheduler provides limits on initialized/pending applications from a single user to ensure fairness and stability of the cluster.

The primary abstraction provided by the CapacityScheduler is the concept of queues. These queues are typically set up by administrators to reflect the economics of the shared cluster. Then, cluster administrators, working with stake holders in different organizations of a single enterprise, allocate a range of available resources for each queue.

Mapping Resource Queues to Organizations

The hierarchy of queues you may want to create to reflect that organizational structure may look something like this:

Engineering : Dev, QA
Support : Training, Service
Marketing : Sales, Ad

각 하위 큐는 상위 큐에 연결된다. 최상위 그룹은 조직에 속하지 않고 임시 "root" 큐에 연결된다.

One important thing to note here is that Support, Engineering, and Marketing are what is known as Parent queues. Parent queues enable the management of resources across organizations and sub-organizations. They can contain more parent queues or leaf queues.

How Queues Work

All users of the cluster (regardless of what organization they belong to) submit jobs into the Leaf queues.

The root queue understands how the cluster capacity needs to be distributed among the first level of parent queues, and invokes scheduling on each of its child queues.
Every parent queue applies its capacity constraints to all of its child queues. 모든 상위 큐는 용량 제한을 모든 하위 큐에 적용한다.
Leaf queues hold the list of active applications (potentially from multiple users) and schedule resources in a FIFO (first-in, first-out) manner, while at the same time adhering to capacity limits specified for individual users.
Queues own a fraction of the capacity of the cluster, and this specified queue capacity can be fulfilled from any number of nodes in a dynamic fashion. 대기열은 클러스터의 용량의 일부를 소유하며 지정된 대기열 용량은 동적인 방식으로 임의의 수의 노드에서 수행된다.

Policy-Based Use of Cluster Resources

Scheduler Queues
- Capacity Scheduler allows for multiple tenants to share resources.
- Queues limit access to resources.
- Each queue has ACLs, Access Control Lists, associated with users and groups.
- Capacity guarantees can be set to provide minimum resource allocations, and finally, soft and hard limits can be placed on queues.
- soft and hard limits can be placed on queues.

이러한 큐 관리가 어려운 것처럼 보이지만 절망하지 말자. Apache Ambari를 통해 큐 관리를 쉽게 할 수 있다.

ResourceManager High Availability

Before Hadoop 2.4, the ResourceManager actually was a single point of failure.
The YARN ResourceManager High Availability (HA) feature eliminates this issue. It enables a cluster to run one or more ResourceManagers in an Active/Standby configuration.

ResourceManager HA:

Uses a redundant ResourceManager
Is configured in an Active/Standby configuration
Enables fast failover in response to ResourceManager failure
Permits administrator-initiated failover for maintenance
Is typically configured by Apache Ambari, but can also be configured by hand.

nowol79 / MOOC