Chapter 4. Parallel Processing

copyright by edX PageLinuxFoundationX: LFS103x Introduction to Apache Hadoop

Parallel Processing

This is the chapter that gives you a taste of the rest of the Hadoop ecosystem. 이번장은 하둡 에코시스템의 나머지들을 소개한다. Now, a lot of these ecosystem projects are providing unique parallel data processing capabilities on the data stored in HDFS. 이러한 에코 프로젝트들의 대부분은 HDFS상에 저장된 데이터들을 병렬로 데이터 처리하는것을 제공한다.

But, what makes a framework scalable? 하둡 프레임워크가 확장 가능한 이유가 뭔가? What makes it parallel? 왜 병렬로 만들었는가? How fast can it go? 어떻게 빠르게 처리하는가?

I will try to answer all of these questions, and give you a little bit of a theoretical background, 모든 질문에 대한 답을 찾을 것이고 이론적인 배경도 설명할 것이다. so you can evaluate the top 3 ones that we are going to review in this chapter: MapReduce, Spark, and Hive. 이번장에서 살펴볼 MapReduce, Spark, Hive에 대해서 당신은 평가를 할 수 있게 될 것이다.

MapReduce is the granddaddy of it all. MR은 다른 것들의 초기 버젼이며.. It is the parallel processing framework that started with Hadoop, and, even to this day, it's being developed as part of the Hadoopproject itself. 하둡의 병렬 데이터 처리는 MR로 부터 시작되었고 오늘날까지 많이 사용된다.

Spark came out to fix some of the limitations and shortcomings of the original MapReduce model, and, with its user-friendly API, and lightning-fast in-memory data processing, no wonder it became very popular with data scientists and business intelligence people alike. Spark은 MR의 한계와 결점을 해결하기 위해 나왔으며, 사용자에게 친숙한 API와 in-memory 데이터 프로세싱으로 최근 데이터과학자 및 비즈니스 인텔리전스들에게 인기가 높다. This actually may be the one you would want to pay the most attention to. 이 부분을 당신이 가장 주목하고 있는 것일 수 있다.

Now, Hive - Hive is something completely different. 하이브는 완전 다른 것이다. Hive was the project that put SQL back into Hadoop, allowing a SQL-driven interface to unstructure data. Hive는 SQL을 Hadoop에 다시 추가하는 프로젝트로, unstructure data(No SQL)를 SQL 기반 인터페이스를 통해 가져올 수 있도록 한다. It also allowed Hadoop to compete with some of the juggernauts of the data management industry, the enterprise data warehousing companies. 이를 통해 Hadoop은 엔터프라이즈 데이터웨어 하우징 회사 인 데이터 관리 업계의 거물과 경쟁 할 수있었습니다.

In this chapter you will learn about the following:

Basic principles of parallel data processing and what role Hadoop’s HDFS and YARN play in enabling different parallel data processing frameworks on the same cluster
병렬 데이터 처리의 기본 원칙과 Hadoop의 HDFS 및 YARN이 동일한 클러스터에서 다양한 병렬 데이터 처리 프레임 워크를 구현할 때 수행하는 역할

Key parallel data processing frameworks available on Hadoop:
- MapReduce parallel processing framework and its central Key Value Pair (KVP) paradigm
- Apache Spark in-memory parallel processing framework, its Spark Core RDD engine and various extensions making Spark a successor to MapReduce
- Apache Hive SQL-based parallel processing framework, how it builds on top of MapReduce and provides Enterprise Data Warehouse capabilities for Hadoop
  Enterprise Data Warehouse capabilities가 정확하게 어떤 기능일까?

-WordCount example and its implementation in MapReduce and Spark

Performance and usability considerations of various parallel processing frameworks.

Hadoop: A Parallel Computing Environment

So far, we have seen how Hadoop’s HDFS can be used to reliably and cheaply store vast amounts of data on a cluster of commodity hardware. 지금까지는 하둡 HDFS를 통해 대량 하드웨어 클러스터에 방대한 양의 데이터를 안정적으로 저렴하게 저장할 수 있는 방법을 살펴 보았다. This is a pretty neat trick, but it does not really set Hadoop apart from any traditional distributed filesystem. 이것은 매우 정교하지만 Hadoop을 기존의 분산 파일 시스템과 차별화하지는 않는다. What does the combination of HDFS and YARN do? HDFS와 YARN의 조합은 어떤 의미일까?

The combination of HDFS and YARN turns Hadoop from a storage platform into a parallel computing environment.

HDFS와 YARN의 결합으로 Hadoop은 스토리지 플랫폼에서 병렬 컴퓨팅 환경으로 전환되었다. 이것이 Hadoop이 기존 분산 파일 시스템과 차별화 할 수 있는 강점이다.

While YARN allows you to parallelize the execution of your workload, and even take advantage of data locality in HDFS (something that we will look into in greater detail later on), it still does not answer the question of how to implement those parallel processing algorithms. YARN은 작업 부하의 실행을 병렬화하고 HDFS에서 데이터 지역성을 활용할 수도 있지만 이러한 병렬 처리 알고리즘을 구현하는 방법에 대한 질문에는 대답하지 않는다. Being a general purpose framework, it leaves this to "YARN applications". 범용 프레임 워크이기 때문에 이것은 "YARN 어플리케이션"에 위임한다.

In this chapter, we will look into 3 of the most popular YARN applications that implement different kinds of parallel data processing. 이번 장에서는 다양한 종류의 병렬 데이터 처리를 구현한 가장 인기있는 YARN 응용프로그램 3가지를 알아보자. We will start from the oldest processing framework that defined Hadoop itself, MapReduce, and we will get all the way to SQL-driven processing that is more akin to the traditional Enterprise Data Warehouse MPP (Massively Parallel Processing) approach. 우리는 Hadoop 자체, MapReduce를 정의한 가장 오래된 처리 프레임 워크에서 시작하여 전통적인 Enterprise Data Warehouse MPP(Massively Parallel Processing) 접근 방식에 가까운 SQL 기반 처리에 대해 자세히 설명할 것이다. Before we do that, let us spend a few moments trying to understand what do all these processing frameworks have in common and what are the principles driving their design. 진행하기 앞서 모든 프로세싱 프레임 워크가 가지고 있는 공통점과 설계를 주도하는 원리가 무엇인지 생각해 보자.

A "Data Parallel" Approach

One common theme that you will find among most of the YARN-based parallel processing frameworks on Hadoop is the data parallel approach to problem solving. Hadoop의 YARN 기반 병렬 처리 프레임 워크에서 대부분 발견 할 수 있는 공통된 주제는 문제 해결에 대한 데이터 병렬 접근 방식이다. "What is the data parallel approach?" you may ask. 데이터 병렬 접근 방식이 무엇인가? Well, let me explain it to you, using a very simple analogy. 아주 간단한 비유를 사용하여 설명해 보자. Suppose you were given a task of going through a stack of quarters and counting every single one of those that has an even number printed on it. 쿼터 스택을 가지고 짝수가 인쇄 된 모든 것을 하나씩 계산하는 작업을 했다고 가정하자.

So, for example, with a year 1932 would count as one, since 1932 is an even number. 1932년은 짝수이기 때문에 1로 계산된다. This is a pretty easy task to do with something like a swear jar. 이것은 swear jar 같은 것을 사용해도 쉽게 해결된다. But, suppose you were working at a bank, with millions upon millions of coins to go through. Then, the pile of quarters in front of you may look something like this. 하지만 수백만 달러의 동전을 가지고 은행에서 일하고 있다면 ... 동전 더미는 아래와 같을 것이다.

Big enough to make you depressed just thinking about all of the sleepless nights ahead of you. But then, a thought crosses your mind. You remember all your friends who can help with this task and you invite them all over. 혼자 하기 버거운 작업을 도울 친구들을 소환한다. Then, what you do is, you divide the original pile into smaller chunks, you give a chunk to every single one of your friends, and you ask them to perform the original task that was assigned to you, which means go through their pile of quarters, and count the ones that have an even number printed on them. 그리고 원래 파일을 작은 덩어리로 나누고, 친구 한 명 한 명에게 청크를 주고 자신에게 할당 된 원래 작업을 수행하도록 요청한다. Then, your friends start doing their work and reporting results back to you. So, one of your friends may say "I've got forty", another one could say "I've got only two", and yet another one could say "I've got nothing". What you do then is, you just add all of those numbers up, and you arrive at the final answer, which actually happens to be an answer to the original task that was assigned to you. So, this quarter counting is an example of a problem that has an excellent data parallel solution, or, in other words, it has a solution that can get faster simply by you inviting more friends and giving them all work to do. 이는 데이터 병렬 솔루션을 제공하는 한 예이다. 더 많은 친구를 초대하고 모든 일을 할 수 있게하면 더 빠르고 간단하게 해결할 수 있는 솔루션을 제공한다.

In Hadoop's world, your friends would correspond to the parallel processing task managed by YARN, and the pile of quarters would correspond to blocks in HDFS. 하둡에서 친구들은 YARN이 관리하는 병렬 처리 작업에 해당하며, 친구들에게 할당된 더미 조각들은 HDFS의 블록에 해당한다. Any problem that has a data parallel solution can thus be solved as fast as you want, provided that you add more workers to the pool. 데이터 병렬 솔루션이 있는 모든 문제는 원하는 만큼 빠르게 해결할 수 있다.단 풀에 더 많은 인력을 추가해야 한다.

But, still, there are limits. So, let's talk about those. 하지만 여전히 한계가 있다. One of these limits is based on how finally you can slice the data. 이러한 한계 중 하나는 데이터를 어떻게 분할할 수 있는지이다. After all, if you have 1 million coins, and you invite 2 million friends, then, at least 1 million of your friends would have nothing to do. 1백만 동전이 있고 200만 명의 친구를 초대하면 적어도 1백만 명의 친구가 아무것도 할 수 없게 된다.

A slightly less obvious limit has to do with the fact that, even if you have more coins than friends, the total amount of time you spend on your task will be limited by how long it takes for your slowest friend to process his or her pile. 분명한 한계는 친구보다 동전이 많더라도 작업에 소비하는 총 시간이 가장 느린 친구가 처리를 완료하는데 걸리는 시간에 종속된다. So, again, there's a problem. The final limit is actually the least obvious of them all, and it has to do with how much of your overall task can actually be paralyzed, because, remember, we assume that the initial pile was a neatly stacked set of quarters, and you didn't have to do any upfront investment, like move them from one place to the other, or maybe sort them all out. 또 하나 초기 더미가 깔끔하게 정리된 것으로 가정했지만 .. 미리 정렬하고 옮기는 등의 선행 투자가 필요할 수 있다.

If you have to do anything like that, and that part of your job cannot really be paralyzed, then you are only benefiting from the part that can be, which is the quarter counting itself. The practical effects of this last limit are so vast and so important, that they actually have a special name in computer science. This limit is called Amdahl's law. 이러한 한도를 암달의 법이라고 한다. It is named after computer scientist Gene Amdahl, and it's probably one of the most significant theoretical laws governing limits of every parallel processing system. 모든 병렬 처리 시스템의 한계를 결정하는 가장 중요한 이론적 법칙 중 하나일 것이다. 암달의 법칙

Here is what Gene himself had to say about it. So, let me quote: "For over a decade, prophets have voiced the contention that the organization of a single computer has reached its limits, and the truly significant advances can only be made by interconnection of multiple computers, in such a manner as to permit co-operative solution ...

The nature of the overhead (in parallelism) appears to be sequential though, so that it is unlikely to be amenable to parallel processing techniques.오 오버 헤드의 성격(병렬성에서)은 순차적 인 것처럼 보이지만, 병렬 처리 기술을 적용하기는 쉽지 않다.

Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate, even if the housekeeping were done in a separate processor ... At any point in time, it is difficult to foresee how the previous bottlenecks in a sequential computer will be effectively overcome".

Let's unpack this statement. The formula itself actually does little justice to show how scary the consequences of the Amdahl's law really are.

So, let's plot this on a graph. On the X axis, we will have a number of workers that we can throw at the problem. On the Y axis, we'll plot the number of times we get to go faster if we throw that many workers at the problem. What we are plotting here are different lines corresponding to how much parallelism your overall task has. So, the green line, for example, at the very top, corresponds to your algorithm, or overall task, having a whopping 95 parallelism rate.

The blue line, at the very bottom, corresponds to a more modest, just 50% of the overall task is parallelizable. What you can see then is an amazing effect. Basically, even if you have an algorithm that can be 95% parallelizable, which means it can be executed in 95% of its entirety in a perfect parallel fashion, then you only get a speed up of 20, no matter how many workers you throw at the problem. 기본적으로 95%의 병렬 처리가 가능한 알고리즘이 있어도 완벽한 병렬 방식으로 전체의 95%에서 실행될 수 있다는 것을 의미한다. 많은 작업자를 투입해도 20배의 속도만 얻을 수 있다. 20 becomes your asymptotic limit of a speed-up. 20은 속도 향상의 점근선 한계가 된다. Always remember this when thinking about theoretical limits of various data parallel processing frameworks, Hadoop or not. Hadoop과 같은 다양한 데이터 병렬 처리 프레임 워크의 이론적인 한계를 생각할 때 항상 이것을 기억하자.

What Is MapReduce?

Now that we have spent quite a bit of time talking about how important YARN is to Hadoop, let us tell you a little secret: Hadoop did not begin with YARN. Hadoop began as HDFS and a very customized implementation of the first parallel processing framework, MapReduce.

That initial implementation was eventually refactored into the distributed management and scheduling part (YARN), and the processing part is still called MapReduce. That clean separation was introduced as part of the Hadoop 2 effort. 초기 MR 구현은 결국 분산 관리 및 스케줄링 부분(YARN)으로 리팩토링 되었으며, 처리 부분은 여전히 MapReduce라고 불린다.

This makes MapReduce the oldest parallel processing framework available on Hadoop. It is very simple to understand. MR은 Hadoop에서 사용할 수 있는 가장 오래된 병렬 처리 프레임 워크이다.

What Do Mappers and Reducers Do?

Mappers are the processes doing parallel computation of some sorts, just like your friends were counting quarters. Each Mapper produces a value or a stream of values that is then given to a Reducer (or Reducers), so that a Reducer can arrive at a final answer.

What Is A Mapper?

The Mapper reads data in the form of key-value pairs (KVPs), and it outputs zero or more KVPs. Mapper는 key-value 형태로 데이터를 읽고 0개 이상의 KVP를 출력한다. The Mapper may use or completely ignore the input key. Mapper는 입력키를 사용하거나 완전히 무시할 수 있다. For example, a standard pattern is to read a line of a file at a time: the key is the byte offset into the file at which the line starts; the value is the content of the line itself. Typically, the key is considered irrelevant with this pattern. 예를 들어, 표준 패턴은 한 번에 파일의 행을 읽는 것이다. key는 행이 시작되는 파일의 바이트 오프셋이다. 값은 행 자체의 내용이다.

If the Mapper writes anything out, it must do so in the form of KVPs. Mapper가 아무것도 쓰지 않는다면, KVP의 형태로 있게 된다. This stream of KVPs from a Mapper (intermediate data) is not stored in HDFS, but rather consumed over the network by one or more Reducers. Mapper의 KVP 스트림은 HDFS에 저장되지 않고 네트워크로 흘려서 하나 이상의 Reducers로 전달된다. The local attached storage on the nodes running Mappers is used to buffer data before it is being consumed by Reducers. Mappers를 실행하는 노드의 로컬 연결 스토리지는 Reducers가 데이터를 사용하지 전에 데이터를 버퍼링하는 데 사용된다.

MapReduce operates on a set of files (typically coming from HDFS) given to it as an input. MapReduce는 입력으로 주어진 일련의 파일(일반적으로 HDFS에서 제공)에서 작동한다. Based on the total number of blocks in that input data set, the MapReduce framework will calculate what is known as splits. MR 프레인 워크는 입력 데이터의 세트의 총 블록 수를 기반으로 스플릿이라고 알려진 것을 계산한다. Splits are contiguous slices of data given to a single Mapper to work on. Thus, the number of Mappers launched for a given MapReduce job equals the number of splits in the input data set. Splits는 하나의 Mapper가 작업할 인접한 데이터 조각이다. 따라서 주어진 MapReduce작업을 위해 시작된 매퍼의 수는 입력 데이터 세트의 스플릿 수와 같다.

The size of each split is typically the same as the HDFS block size. If you combine that with the "one Mapper per split" rule, you can see how MapReduce can ask YARN's scheduler to make sure that each Mapper is launched on the worker node where the block is physically stored. In fact, given that blocks are typically triple-replicated, there may be at least three worker nodes in a cluster where a Mapper for each split can be launched. 블럭이 실제로 3중 복제된다는 점을 감안할 때 각 분할에 대한 Mapper를 시작할 수 있는 클러스터에는 적어도 세 개의 작업자 노드가 있을 수 있다.

Keep in mind, though, that, while MapReduce tries very hard to make sure the processing happens on blocks stored on the same worker node where the Mapper is running, it is ok if a Mapper needs data that isn't locally available. MapReduce는 Mapper가 실행중인 동일한 작업자 노드에 저장된 블록에서 처리가 이루어 지도록 매우 열심히 노력하지만, Mapper가 로컬에서 사용할 수 없는 데이터를 필요로 하는 경우도 문제가 없습니다.

HDFS, being a distributed filesystem, makes that happen transparently to the Mapper. The only downside is the increased processing time, since the block has to travel over the network. 분산파일 시스템인 HDFS는 Mapper를 투명하게 동작시킵니다. (사용자는 블록이 로컬에 있는지 원격에 있는지 모릅니다.)하지만 블록이 네트워크를 통해 이동해야 하기 때문에 처리 시간이 증가한다는 단점은 있습니다.

Finally, even though the name of the framework is MapReduce, Mappers are the only mandatory part. 프레임워크 이림이 MapReduce이지만 Mappers가 유일한 필수 부분이다. To put it differently: it is very much possible to have a MapReduce job that only has Mappers and doesn't do any shuffling or use Reducers. This type of job is called a Map-only job, and is typically used to run long-lasting, distributed services on Hadoop clusters that have nothing to do with data processing or employ a data processing model that simply can't be expressed naturally in a MapReduce paradigm.

What Is A Reducer?

After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list. Map 단계가 끝나면 주어진 중간 키의 모든 값이 하나의 목록으로 결합된다. This list is given to a Reducer. 목록은 Reducer로 전달된다. We may have a single Reducer, or multiple Reducers. 하나 혹은 여러개의 Reducer가 있을 수 있다. Unlike the number of Mappers (which is based on the number of input splits), the number of Reducers is governed by a configuration option, and is 1 by default. Mapper수와 달리 Reducer수는 configuration option에 달려 있다. 기본값인 1이다. One thing to keep in mind here is that the number of Reducers will determine the number of the final output files for the MapReduce job. Reducer의 수가 MapReduce작업의 최종 출력 파일 수를 결정한다.

That is, if you have 10 Reducers, you will have 10 files in the output directory, each of which contain part of your results. Those 10 files will have to be concatenated in order to like an output of a single Reducer. In fact, this is such a common operation, that HDFS provides a command line option as a short hand to concatenate files in HDFS into the local destination file. We will use this option in Chapter 5.

All values associated with a particular intermediate key are guaranteed to go to the same Reducers. The intermediate keys, as well as their value lists, are passed in a sorted order. 특정 중간 key와 관련된 모든 값은 동일한 Reducer로 전달된다. 중간 값 Key-Value 목록은 정렬된 순서로 전달된다.

The invisible MapReduce phase that makes sure the Reducer gets the keys in a sorted order is called a Shuffle phase. Reducer가 정렬 된 순서로 Key를 가져오는 보이지 않는 MapReduce 단계를 Suffle 이라고 한다. The Reducer outputs zero or more KVPs. Reducer는 0개 이상의 KVP출력한다. KVP는 HDFS에 쓰인다. These KVPs are written to HDFS. In practice, the Reducer often emits a single KVP for each input key.

What Real Work Can Be Done with Mappers and Reducers?

Mappers and Reducers are all there is to a MapReduce framework. It may even seem surprising that you can do a lot of real world data analysis based on these two building blocks alone.

In order to demonstrate how this is done in practice, and also introduce you to the Java API of the MapReduce framework, let us approach the following problem: given a large body of human-readable text stored in HDFS, produce the list of words with the number of times each word was seen in all of the files. Simply put, let's count the words!

By the way: this task is to data processing what printing "Hello World" is to computer languages. In short, the code you're about to see is the "Hello World" of data processing frameworks. You should get used to it!

A WordCount Mapper and Reducer

아래는 Mapper에 대한 pseudo-code이다. 텍스트에서 string를 읽어서 word는 key로 값을 1로 셋팅한다.

Reducers는 mappers에서 생성한 key:value 조합을 가져와 merge하게 된다.

이로써 우리는 hdfs에 흩어져 있는 파일에서 단어가 얼마나 사용되는지 카운팅할 수 있다.

MapReduce

Let's now take a step-by-step look at what's happening with two lines of text right in the middle of a file being processed by MapReduce.

One of our mappers, and remember, there's a lot of them running in parallel, will be given two lines of text in the form of two key value pairs, the key being the byte offset in the file, and a value just the literal string at that offset. So, in our case, at offset 8675, there is a line that reads "I will not eat green eggs and ham", and the next line, at offset 8709, reads "I will not eat them Sam I am" Then, our mapper will look at both of these lines, throw away the initial keys, and produce the following stream of key value pairs, starting with the first pair being "I", just a letter "I", 1; then, the next pair being "will" and 1, and then, the next pair being "not" and the number 1, and so forth.

Note that no attempt is being made to optimize within a single record.

We're not really counting "I"s, even if we had a few of them in the single line. That is actually a job for a combiner - that is an extension to a pure MapReduce model available in Hadoop. But, let's go back to what happens next. Next up is the silent, user-invisible phase of MapReduce, where all the keys are sorted and values corresponding to the same key are combined together. In fact, if you're wondering what makes a simple paradigm like MapReduce so powerful, it is the secret sorting phase. and it is called... also called a "shuffle" phase.

Given how central sorting is to MapReduce, it comes as no surprise to know that Hadoop clusters have consistently set world records on how quickly they can sort large amounts of data. But ok, let's go back to our example. After the shuffle comes a "reduce" phase. Each reducer is getting the result of the sorting phase; in our case, the key "I" has a vector of 3 values. All these values are 1s, because we only produced 1s in the previous phase. The key for "eat", we'll have a vector of 2 values.

Same values of one, and everything else will have a vector, which is a single value. Again, value of 1. Once again, notice that the way we are constructing this, is by storing all the outputs of every single mapper and combining values corresponding to the same key into those vectors. This means that, even if you had three separate mappers producing the key "I", all of the values corresponding to that key, even coming from different machines in the cluster, will be aggregated into a single vector, and given to a reducer.

Then, all that the reducer does is sums up the keys and outputs a brand-new set of key value pairs, with each key being the same as the initial key, in our case, the word itself, and the value being the sum of all the values in the vector, or, in our case, the number of times the word was seen in a text. Pretty neat, huh? Let's actually now go through the end-to-end example of how the Map Reduce runs on a YARN-based cluster.

The client submits a MapReduce job and points at a file sitting in HDFS.
The NameNode can tell the client a map of all the data blocks in that file, and also calculate splits, basically chunks of data that will be given to each mapper to work on.
Then, YARN spins all those mappers up and, the first phase begins: the mapper reads the file blocks from HDFS, line-by-line. So, the lines would read like "I will not eat green eggs and ham". The lines of text are split into words, and output to the next phase, at which point we are basically dealing with those key value pairs you've already seen.
Then, the shuffle phase aggregates all of the values for this same key, and that is what is being given to our reducer.
Then, the reducer kicks in and all it has to do is to basically sum up all the ones corresponding to every single key. Arrived at the final result, and the last step is just simply writing that final result to HDFS. So, we started with the file in HDFS, and we ended with a different file in HDFS. That is it. That is how the MapReduce framework is implemented on Hadoop clusters.

Shortcomings of MapReduce

MR의 단점

The basic concepts behind MapReduce were introduced by Google's paper MapReduce: Simplified Data Processing on Large Clusters more than 10 years ago. While parallel data processing algorithms implemented using MapReduce frameworks are still widely used even today on Hadoop clusters, the MapReduce approach is now showing its age. 현재도 10년전 구글 논문 방식 그대로이다.

First, there is the minimalistic approach of MapReduce, where one has to express algorithms of arbitrary complexity using just two basic building blocks: Mappers and Reducers. In that, MapReduce can be compared to programming in assembly language. While it's doable and still used where needed, it is certainly not an experience programmers are looking forward to. Just like with higher-level languages, there is a desire to reason in higher-level data analysis abstractions, rather than in Mappers and Reducers. Mappers/Reducers가 두 개의 기본 빌딩 블록을 유지해야 한다.

The second area where MapReduce is showing its age is its inability to retain results in memory for further computations. Once Reducers are done, all the results have to be serialized back to HDFS. There is no way to keep them in memory for further processing. 더 많은 프로세싱 결과를 메모리에 유지할 수 없다. Reducer가 완료되면 모든 결과를 HDFS로 다시 직렬화해야 한다. 추가 처리를 위해 메모리에 유지할 방법이 없다. 매번 파일형태.. 즉 disk에 써야하는 비용이 크다.

Overcoming both of these challenges gave rise to a whole ecosystem of parallel data processing frameworks. One of those frameworks (Apache Spark) is now even considered to be a replacement for MapReduce. 이러한 결점을 극복하기 위해 병렬 데이터 처리 프레임 워크의 새로운 생태계가 생겨났다. 이러한 프레임워크 중 하나 인 Apache Spark은 현재 MR를 대체하는 것으로 간주된다.

Introducing Apache Spark

Spark was created to be a general purpose data processing engine, focused on in-memory distributed computing use cases and higher-level data manipulation primitives. Spark는 메모리 내 분산 컴퓨팅 사용 사례와 상위 수준의 데이터 조작 프리미티브에 초점을 맞춘 범용 데이터 처리 엔진으로 개발되었다. Spark took many concepts from MapReduce and implemented them in new ways. Spark는 MR에서 많은 개념을 도입하여 새로운 방식으로 구현했다.

Spark uses APIs written in Scala, that can be called with Scala, Python, Java, and, more recently, R. The core Spark APIs focus on using key value pairs for data manipulation.

Spark is built around the concept of an RDD. RDD stands for Resilient Distributed Dataset: resilient in the sense that data can be recreated on the fly in the event of a machine failure, and distributed in the sense that operations can happen in parallel on multiple machines working on blocks of data, allowing Spark to scale very easily in terms of size of data and the number of nodes in a cluster it can utilize for its parallel data processing tasks. 기계 고장시 즉시 데이터를 재현할 수 있고 데이터 블록에서 작동하는 여러 컴퓨터에서 작업이 병렬로 발생할 수 있다는 점에서 복원력이 뛰어나 Spark에서 데이터 크기와 클러스터의 노드 수 측면에서 매우 쉽게 확장할 수 있으므로 병렬 데이터 처리 작업에 활용할 수 있다.

And, of course, just like MapReduce, Spark is using YARN for scheduling and resource management. This makes Spark a YARN-native parallel processing engine. MR과 마찬가지로 Spark은 스케쥴링 및 자원 관리를 위해 YARN을 사용하고 있다.

Apache Spark in a Nutshell

Apache Spark is a data access engine for fast, large-scale data processing hosted on YARN. It is designed for iterative in-memory computations and interactive data mining, and provides expressive multi-language APIs for Scala, Java, and Python. Data scientists and business intelligence professionals can use built-in libraries to rapidly iterate over data for:

ETL
Machine learning
Graph workloads
Stream processing

Comparing Apache Spark to MapReduce

The initial motivation for Spark was that iterative applications did not work well with MapReduce. Spark이 초기 컨셉은 MR의 반복 작업이 불편해서 이다. Iterative applications in MapReduce require intermediate data to be written to HDFS, and then, read again for every iteration. MR에서는 반복작업시 매번 중간 데이터를 HDFS에 기록한 다음 매 반복마다 다시 읽어야 한다. Spark's goal was to keep more data in memory, to minimize the expensive read/writes that plagued developers. Spark의 목표는 메모리에 더 많은 데이터를 보관하여 개발자를 괴롭히는 비싼 읽기/쓰기를 최소화 하는 것이다. Reading from memory is measured in nanoseconds, while reading from disk is measured in milliseconds.

Now that all the intermediate results are stored in the memory of each Spark node in the cluster, the question becomes: what happens if the node crashes? 모든 중간 결과가 클러스터의 각 Spark 노드의 메모리에 저장되었으므로 노드가 down되면 어떻게 됩니까? The answer is Spark's innovative use of a very particular data structure used to store the data in-memory across multiple nodes in a cluster. This data structure is called RDD: Resilient Distributed Dataset. 그 대답은 Spark이 클러스터의 여러 노드에 데이터를 메모리에 저장하는 데 사용되는 매우 특정한 데이터 구조를 혁신적으로 사용한다는 것이다. 이 데이터 구조를 RDD라고 한다.

Apache Spark RDDs

In order to understand RDDs, let's first take a look at how they are being exposed to a Spark user. In fact, let's go back to our WordCount example and implement it using Spark:

JavaRDD<String> textFile = sc.textFile("hdfs://....");
JavaPairRDD<String, Integer> counts = textFile
    .flatMap(s -> Arrays.asList(s.split(" ")).iterator())
    .mapToPair(word -> new Tuple2<>(word, 1))
    .reduceByKey((a,b) -> a + b);
counts.saveAsTextFile("hdfs://....");

First of all, this looks much more natural to a human reader. Even if you don't know Spark, you can easily guess what is going on here. In fact, this is pretty much a shorter version of the Mapper and Reducer codes we have seen earlier.

Each RDDs is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, Amazon's S3, etc. In fact, this is exactly what is happening in the first line of our example. We are creating an RDD called textFile by pointing at a dataset in HDFS. This kind of RDD doesn't make Spark do anything just yet - it simply serves as an anchor point for further data operations.

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the client after running a computation on the dataset. RDD는 두 가지 유형의 작업, 즉 기존 데이터 집합에서 새 데이터 집합을 만드는 변환과 데이터 집합에서 계산을 실행한 후 클라이언트에 값을 변환하는 작업을 지원한다.

You can chain transformation and keep deriving new RDDs from a current one. All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the client. This design enables Spark to run more efficiently. By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in-memory, using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

In our case, we create a chain of 3 transformations. The first one is called flatMap and is applied to our original RDD derived from files in HDFS. This implicitly produces a second RDD. Then we apply mapToPair transformation to it and get our third RDD. The final transformation is reduceByKey and it gives us our fourth and final RDD, called counts. As our last action we apply an action that forces the content of this RDD to be written to HDFS and, in turn, makes the whole computation kick-in (remember - the transformations in Spark are "lazy").

Word Count RDD Chain

Below you can see the representation of a chain of RDDs that gets created by executing our code:

There are three important things to understand about this chain:

Every RDD can be derived from a previous RDD by applying a well-defined operation
The dataset represented by the RDD does not get computed right away, only when needed
RDDs are ideally suited for partitioning between different nodes in the cluster.

RDD Partitioning

Spark은 병렬 처리를 최대화하고 네트워크 트래픽을 최소화하기 위해 항상 데이터를 분할하려고 한다. 각 RDD는 파티션되지만, 파티션이 즉시 계산되지는 않는다. (remember: transforming RDDs in Spark is a "lazy" operation).

앞서 살펴보았던 WordCount 예제의 논리적 파티션 흐림이다.

However, only when the last element in the chain (collect) actually triggers the need for computing the RDDs will be triggered to produce some actual data. At that point, you can think of partitions mapping to physical nodes in the cluster and participating in an overall computation. 마지막 단계 주황색 체인은 실제 필요한 시점에 트리거된다. 이 시점에 클러스터의 물리적 노드에 파티션 매핑을 생각하고 전체 계산에 참여하게 된다.

Cluster with Two RDDs (4 Partitions Each)

Since both map and reduceByKey RDDs have four partitions each, here is how they may end up:

RDD Persistence Model

데이터 세트를 생성하는 게으른 접근법과 적극적인 메모리 사용이 있더라도 RDD를 외부 저장소에 보관해야 할 때가 있다. Spark은 아래와 같은 RDD지속성 옵션을 제공하낟.

Spark: An Evolution of MapReduce

Spark는 MR에 비해 몇 가지 이유로 더 빠르다.

Caching data to memory can avoid extra reads from disk
Scheduling of tasks is reduced from 15-20s to 15-20ms
Resources are dedicated the entire life of the application
It can link multiple maps and reduces together without having to write intermediate data to HDFS
Not every reduce requires a map.

Spark Streaming

Spark Streaming is a library built on top of the Spark Core framework. Spark Streaming allows for in-flight data processing with a latency as low as 1-2s. While this may not be considered real-time stream processing, in most practical applications it is good enough.

A streaming Spark application consists of the same components as a Core application (like WordCount), but it adds the concept of a receiver. The receivers listen to data streams, and create batches of data called Dstreams, that are then processed by the Spark Core engine. The nice thing with streaming is that once the data is ingested, we have the liberty to use all the Spark frameworks, and we are not limited to just using the Spark streaming functionality.

Beyond Spark RDDs: Datasets

Spark Datases : 데이터 집합은 관계형 스키마에 매칭되는 강력하게 형식화되고 불변인 개체 모음이다. he same core concept that has been powering more traditional Enterprise Data Warehouse (EDW) relational databases, even before Hadoop came into existence.

이것이 Hadoop이 이제 관계형 데이터베이스틔 대체품으로 사용될 수 있음을 의미하지 않는다. Hadoop클러스터에 저장되는 데이터는 여전히 대부분 구조화 되지 않았다. 그러나 처리와 관련하여 어떤 종류의 구조를 추론해야 한다. 그렇지 않으면 데이터 처리 단계가 의미가 없다.

Once you assume relational schema as your logical view, it is only a matter of time before you realize that you may as well use SQL as a high-level language to express various, common data processing ideas. Not surprisingly, Spark itself has an SQL implementation on top of Datasets, but, instead of reviewing that implementation, we will focus on the oldest SQL processing framework available for Hadoop: Apache Hive.

Enterprise Data Warehousing (EDW)

In order to truly fit into the EDW use case, Hadoop had to fit in existing skills and tools of EDW practitioners, not just offer effective ways to do data processing. EDW 사용 사례에 진정으로 적합하도록 Hadoop은 데이터 처리를 수행하는 효과적인 방법을 제공하는 것뿐만 아니라 EDW 실무자의 기존 기술과 도구에 적합해야 했다.

This realization created a vibrant ecosystem of SQL-based parallel data processing frameworks supported by YARN.

For the rest of this chapter, we will focus on the grand daddy of them all: Apache Hive. However, if you are looking to Hadoop as an EDW replacement, you should definitely check out all the alternatives as well: Spark SQL, Apache Drill, Apache HAWQ (incubating), Apache Impala (incubating), and Presto, just to name a few.

Introducing Apache Hive

Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop that provides data summarization, ad hoc query, and analysis of large datasets.

Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language, and Hive supports overwriting or appending data (with recent versions also supporting more sophisticated UPDATE and DELETE operations).

Within any particular database, the data in the tables is serialized, and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how the data is distributed within the sub-directories of the table directory. The data within partitions can be further broken down into buckets.

Hive Architecture

Hive는 MPP (Massively Parallel Processing) 관계형 분석 데이터베이스와 매우 흡사하다. Hadoop 클러스터에 두 가지 주요 구성요소를 추가하여 이러한 기능을 유지한다.

HiveServer2 (for managing clients connections, just like a regular relational database would)
MetaStore (for mapping unstructured data stored in HDFS into relational schema)

Behind the scenes, any client connecting to Hive goes through the following steps:

Issue an SQL query, either through a command line tool or a JDBC connection.
HiveServer2 parses the query and prepares an execution plan, based on the data in the MetaStore.
Hive executor converts the query into a series of MapReduce jobs or a Tez job.

Hive: A Layer on Top of MapReduce?

사용자가 query를 던지면 이를 MR작업 컬랙션으로 만들어 실제 클러스터에서 실행합니다.

앞에서 언급했듯이 MR의 단점들을 HIVE도 그대로 가져가게 됩니다. 하지만, 최근에는 MR이외에 Tez를 사용할 수 있다 These days, however, there are more choices. It started with Hive growing its own execution layer, called Tez. Tez is a YARN-native parallel processing framework, much better optimized for the kind of data processing needs that Hive has. On top of that, the Apache Hive community has explored using Spark as an alternative execution layer, further developing the view that Spark is a much more capable replacement to MapReduce.

In order to appreciate what difference an execution layer makes to Hive, and also to understand how Hive orchestrates the flow based on SQL queries, let's take a query that has a few JOINs in it. The reason we focus on JOINs (even if they make our example query a tad complex) is simple: JOINs are at the heart of making relational approach to data wrangling work. Understanding how different Hive execution layers handles JOINs will give you a solid basis for evaluating them.

실행계층(Tez, Spark..)이 Hive와 어떤 차이가 있는지 알아보고 Hive가 SQL 쿼리를 기반으로 흐름을 조정하는 방법을 이해하기 위해 몇 가지 JOIN이 포함된 쿼리를 살펴보자.

Understanding how different Hive execution layers handles JOINs will give you a solid basis for evaluating them.

Anatomy of a Hive Query

hive 쿼리 해부

Let's look at a toy example of a typical Business Intelligence (BI) query: we will be working with a database containing sales information for an online merchant selling to customers in different states. Table a is the main table with all the sales transactions. A column called state in that table has a US state in which a customer who bought an item resides. Another column called itemId links us to a primary key in a table c that contains a full catalog of all the items that we're selling (including a price of an item recorded in a column called price).

Suppose that for each state we are interested in the total number of sales and an average price of those sales for all the female customers. 각 주마다 우리는 총 판매 수와 모든 여성 고객에 대한 평균 가격에 관심이 있다고 가정하자. Now, since it is very unlikely that we will have that gender information available to us upfront, we may need to find it out by other means. 우리는 성별정보를 미리 알 수 없기 때문에 다른 방법으로 찾아야 할 수도 있다. For each row in the transaction table a we could then record id of that row in an index table b if we have reasons to believe that the transaction was done by a female customer. 트랜잭션 테이블의 각 행에 대해 거래가 여성 고객에 의해 수행되었다고 믿을만한 이유가 있는 경우 인덱스 테이블 b를 기록할 수 있다.

With all that in place, our query then will look something like this:

SELECT a.state, COUNT(*), AVG(c.price) FROM a
     JOIN b ON (a.id = b.id)
     JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

For a query like this, Hive will have to construct a pipeline with four different stages, corresponding to two of the JOINs, a GROUP BY and scans of tables. HIVE는 두 개의 JOIN, GROUP BY 및 테이블 스캔에 해당하는 네 가지 단계로 파이프 라인을 구성해야 한다. For example, an output of the first JOIN and an output of the table b scan will be all feeding into the final stage of the entire query execution pipeline, as shown in the following diagram (where M represent Mappers, and R represent Reducers): 예를 들어 다음 다이어그램에서와 같이 첫번째 JOIN의 출력과 테이블 b 스캔의 출력이 모두 전체 쿼리 실행 파이프 라인의 마지막 단계로 공급된다.

As you can see, when Hive is running on the MapReduce engine, the only way to exchange data between different stages of the pipeline is to write it all out into HDFS and read it back later. 보시다시피 HIVE가 MapReduce 엔진에서 실행중일 때 파이프 라인의 다른 단계에서 데이터를 교환하는 유일한 방법은 HDFS에 모든 것을 기록하고 나중에 다시 읽는 것이다. Spark and Tez (shown on the right) don't have that limitation. Spark와 Tez에는 이러한 제한이 없다. They can make different stages of the query execution pipeline feed off of each other simply by exchanging data in-memory. On top of that, neither Tez nor Spark force Hive to create Mappers when there is no need for them (like in the final stage of our query execution pipeline that is all about a single Reducer). 이들은 메모리 내에서 데이터를 교환하기 만하고 쿼리 실행 파이프 라인의 서로 다른 단계를 서로 중단 시킬 수 있다. 무엇보다 Tez와 Spark는 Mapper를 필요로 하지 않을 때 Hive가 Mappers를 만들도록 강요하지 않는다. (쿼리 실행 파이프 라인의 마지막 스테이지 에서 처럼 하나의 Reducer에 관한 것이다. )

The Client-Side of Hive: Tables

Now that you understand what happens under Hive's hood, the rest is easy. From the client's perspective, Hive looks like a virtual relational database: virtual because Hive does not really own any data - 클라이언트 관점에서 Hive는 가상 관계형 데이터베이스처럼 보입니다. it just performs the queries against data that is already in HDFS. But, as a client, you do not know that. Hive는 실제로 데이터를 소유하지 않기 때문에 이미 HDFS에 있는 데이터에 대해 쿼리를 수행하기만 합니다. You are operating on the familiar concept of tables. Here is what it takes to create a table in Hive: 다른 RDMS에 익숙한 개념의 테이블에서 작업하고 있습니다. Hive에 테이블을 만드는데 필요한 것은 다음과 같습니다.

CREATE TABLE customer (
      customerID INT,
      firstName STRING,
      lastName STRING
      birthday TIMESTAMP,
) ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ',';

This gives us what is known as a native Hive table. This table is still nothing more than a bunch of CSV files (this is what FIELDS TERMINATED BY ',' tells Hive) in HDFS. 이 테이블은 여전히 CSV파일의 모음이다. The only thing that makes this table native to Hive is that all of those files will be deleted when you drop the table. Otherwise, there is nothing special about them. 이 테이블을 하이브에 기본으로 만드는 유일한 것은 테이블을 삭제할 때 모든 파일이 삭제된다는 것이다. In fact, if you do not want files to be deleted, all you have to do is declare this table as EXTERNAL, like this: 사실 파일을 삭제하고 싶지 않으면 이 테이블을 EXTERNAL로 선언하면 된다.

CREATE EXTERNAL TABLE customer (
      customerID INT,
      firstName STRING,
      lastName STRING
      birthday TIMESTAMP,
) ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ',';

The Client-Side of Hive: Table Location

Regardless of whether a table is native or external, you may be curious about where exactly in the HDFS file tree does Hive manage these files. 테이블이 네이티브인지 외부인지 관계없이 Hive가 이러한 파일을 관리하는 HDFS 파일 트리가 정확히 어디에 있는지 궁금할 수 있다. The answer is: anywhere you want. By default, Hive uses a folder in HDFS specified as part of the Hive configuration file, but you can easily override that with a user-provided folder in HDFS via the LOCATION keyword: 대답은 원하는 곳 어디에서든지 default는 Hive configure파일에 지정된 HDFS폴더를 사용하지만 LOCATION 키워드를 통해 HDFS의 사용자 제공 폴더로 쉽게 덮어 쓸 수 있다.

CREATE TABLE customer (
      customerID INT,
      firstName STRING,
      lastName STRING
      birthday TIMESTAMP,
) ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ','
       LOCATION '/user/roman/customers';

And here is something that usually confuses everybody coming to Hive from a traditional relational database background: Hive를 사용하는 모든 사람들이 기존의 관계형 데이터베이스 배경을 혼란스럽게 만드는 경우가 있다. if you happen to specify the LOCATION with valid files already in that location, the table you are creating won’t be empty. It will contain data that can be de-serialized from those files. location에 유효한 파일이 있으면 LOCATION을 지정하면 생성중인 테이블이 비어 있지 않다. 해당 파일에서 역 직렬화 할 수 있는 데이터가 포함된다. Once again: Hive doesn't really own any data, it just uses whatever happens to be in HDFS. 굉장히 중요한 개념인데 Hive는 어떤 데이터도 소유하고 있지 않으며 단지 HDFS에 있는 모든 것을 사용한다.

Of course, if you want to have full control over how data is loaded into your tables, or, if you want to build them from scratch, it is pretty easy to do so with Hive as well. 물론 데이터를 테이블에 로드하는 방법을 완전히 제외하고 싶거나 처음부터 다시 빌드하려는 경우 하이브에서도 쉽게 수행할 수 있다.

The Client-Side of Hive: Loading Data

If you want to load data from a local filesystem, below is the command to do it:

LOAD DATA LOCAL INPATH '/tmp/customers.csv' OVERWRITE INTO TABLE customers;

Of course, data can also be loaded from a set of folders in HDFS (see the lack of the LOCAL keyword):

LOAD DATA INPATH '/user/train/customers.csv' OVERWRITE INTO TABLE customers;

A more SQL-like way of loading data from an existing query is supported as well:

INSERT INTO customers
               SELECT firstName, lastName, birthday
               FROM users
               WHERE birthday IS NOT NULL;

All these data loading examples look different, but they all have the same results: CSV files with data created in HDFS. All of the metadata telling Hive how these files map to tables are stored in Hive's MetaStore. 이러한 모든 데이터로드 예제는 다르게 보이지만 모두 동일한 결과를 얻습니다. HDFS에서 생성된 데이터가 포함 된 csv 파일 이 파일들이 테이블에 매핑되는 방식을 Hive에서 알려주는 모든 메타 데이터는 하이브의 MetaStore에 저장된다.

The Client-Side of Hive: Queries and Views

Once you infer a relational structure on unstructured data stored in HDFS by declaring tables, the rest is even easier. 테이블을 선언하여 HDFS에 저장된 구조화되지 않는 데이터에 대한 관계형 구조를 추측하면 나머지는 훨씬 쉬워진다. You can process and manipulate that data calling SQL directly from one of the Hive clients, or you can use various tools that know how to issue JDBC requests. Hive 클라이언트 중 하나에서 직접 SQL을 호출하는 데이터를 처리하고 조작하거나 JDBC 요청을 발행하는 방법을 알고 있는 다양한 도구를 사용할 수 있다. You can also create views to facilitate access to frequently issued queries, and structure your Hadoop-based EDW data mart. 자주 발생하는 쿼리에 대한 액세스를 용이하게 하는 View를 생성하고, Hadoop기반 EDW 데이터마트를 구조화 할 수 있다. A view is defined using the CREATE VIEW statement. A view is not a table in Hive with actual data, but a view can be treated like a table. View는 실제 데이터가 있는 Hive 테이블은 아니지만 View는 테이블처럼 취급될 수 있다. For example, you can run the DESCRIBE command on a view to see its schema, and also see it in the list of “show tables”.

Here is how, for example, a view detailing 2017 visitors to the White House can be created:

CREATE VIEW 2017_visitors AS
    SELECT fname, lname, time_of_arrival, info_comment
    FROM wh_visits
    WHERE
        cast(substring(time_of_arrival, 6, 4) AS int) >= 2017
    AND
        cast(substring(time_of_arrival, 6, 4) AS int) < 2018;

All of the information on how Tables map to the data stored in HDFS and how views map to queries is stored in Hive's metastore. 테이블이 HDFS에 저장된 데이터에 매핑되는 방법과 View가 쿼리에 매핑되는 방법에 대한 모든 정보는 Hive의 MetaStore에 저장된다.

Hive: Lightweight SQL Layer or EDW Platform?

When Hive first appeared, back in 2008, it was fighting a lonely battle of trying to keep SQL alive and well inside of an ecosystem that was largely rejecting the relational organization of data. Hive가 처음 등장했을 때, 2008 년에 SQL은 데이터의 관계형 조직을 크게 거부하는 생태계 내부에서 SQL을 생성하게 유지하려는 외로운 싸움을 하고 있었다. This position meant that, from day one, Hive had to strive for extreme flexibility in how it presented itself to the end users. Hive가 최종 사용자에게 자신을 소개하는 방식에서 극도의 유연성을 위해 노력해야 한다는 것을 의미했다. It was up to them to decide whether to use Hive as a fairly loose SQL-driven query layer over largely unstructured data residing in HDFS, or to approach it as a much more traditional Enterprise Data Warehouse (EDW) solution. Hive를 HDFS에 있는 대부분의 비 구조화 된 데이터에 대해 상당히 느슨한 SQL 기반 질의 계층으로 사용할지 또는 훨씬 더 전통적인 EDW 솔루션으로 접근할지 결정하는 것은 전적으로 사용자의 몫이었다.

If your recall our history lesson, you may remember that EDW systems are central repositories of integrated data from one or more disparate sources.

They are used to store current and historical data in one single place, and are primary mechanisms for creating analytical reports for knowledge workers throughout the enterprise. They also come with a vast ecosystem of their own tools and policies, and are typically maintained by a dedicated staff of Database Administration (DBA) professionals. 이들은 현재 및 과거 데이터를 단일 위치에 저장하는 데 사용되며 엔터프라이즈 전반의 지식 근로자에 대한 분석 보고서를 작성하기 위한 메커니즘 이다. 그들은 또 자체 툴과 정책으로 구성된 방대한 생태계를 가지고 있으며 일반적으로 DBA전문가의 전담 직원이 관리한다. In short, they are extremely useful, but also quite heavyweight vaults for all the business-critical data within the enterprise. 즉 매우 유용하지만 기업 내부의 모든 업무에 중요한 데이터를 보관할 수 있는 매우 중량이 큰 보관소이다.

The good news is that the modern versions of Hive can provide both capabilities. 좋은 소식은 Hive의 최신 버전이 두 가지 기능을 모두 제공할 수 있다는 것이다. Our Hive demos in Chapter 5 will focus on the lighter side of Hive (just to show you how easy it is to get going with it). 5장의 Hive 데모는 하이브의 밝은 면에 초점을 맞춘다. Hive, used as an alternative to traditional EDW systems, requires much more upfront configuration and setup and will not be covered in this course. 전통적인 EDW 시스템의 대안으로 사용되는 Hive는 훨씬 더 많은 초기 구성과 설정을 필요로 하며 이 과정에서는 다루지 않는다. You would have to lock off the access to data, and only provide it via a traditional EDW API of ODBC or JDBC, you will have to start storing all the data in Hive managed non-text-based tables, like ORC or Parquet, and you will have to make sure Hive leverages the LLAP capabilities of its execution engine. And, of course, since EDW requires more than a full-powered SQL engine, you will have to add security, governance, data movement, workload management, monitoring, and user tools to help Hive really shine as an EDW system (these additional functions are covered by companion Apache projects, some of which we will see in Chapter 6). Hive가 관리하는 비 텍스트 기반 테이블(ORC or Parquet)에 모든 데이터를 저장해야 한다. Hive가 실행 엔진의 LLAP 기능을 활용하는지 확인해야 한다. 물론 EDW는 완전한 성능의 SQL엔진 이상을 필요로하기 때문에 보안, 거버넌스, 데이터 이동, 작업 부하관리, 모니터링 및 사용자 도구를 추가하여 Hive가 실제로 EDW 시스템으로 빛나도록 할 수 있다.

If you are just starting with Hadoop, all of this may look like an overkill, which brings us back to the extreme flexibility that Hive offers. All you need to remember for now is that, even if you're starting by simply using it as a lightweight SQL-driven parallel data processing framework, it has you covered every single step on your eventual journey towards a full blown EDW system. 지금 당장 기억해야 할 것은 가벼운 SQL 기반 병렬 데이터 처리 프레임 워크로 사용하는 것부터 시작한다고해도 완전히 날아갈 EDW 시스템을 향한 궁극적인 여정에서 모든 단계를 다룹니다.

nowol79 / MOOC