Chapter 6. Hadoop Security

copyright by edX PageLinuxFoundationX: LFS103x Introduction to Apache Hadoop

In this lesson, I will teach you about data security, and how to protect users and their data in a fully secure, fully multi-tenant Hadoop deployment. 6장에서는 데이터 보안과 완전하고 안전한 멀티 테넌트 Hadoop 배포에서 사용자와 데이터를 보호하는 방법에 대해 설명한다.

We'll start this lesson by quickly reviewing five basic principles of any enterprise security architecture, and how those principles are reflected in the design of Hadoop itself, and its ecosystem projects. 엔터프라이즈 보안 아키텍처의 5가지 기본 원칙을 빠르게 검토하고 이러한 원칙을 Hadoop 자체 및 생태계 프로젝트의 디자인에 반영하는 방법에 대해 설명한다.

We will then introduce two new members of the Hadoop ecosystem: Apache Atlas and Apache Ranger, and I will teach you how those play a very important role in providing a holistic approach to the Hadoop cluster security. Apache Atlas 와 Apach Ranger를 소개한다.

Now, as with any comprehensive rollout of enterprise security policies, Hadoop has a lot of moving parts. In order for you to understand the interaction between all of those, I will end this chapter by giving you an end-toend example of a Hive query, that is going from a client's laptop all the way into the Hadoop cluster, and what kind of security considerations do you have to keep in mind to make it all work together.

Finally, this chapter is not meant to be a fully comprehensive guide into an overall security architecture, but rather a primer on how Hadoop overlaps with it, and what bits and pieces you need to know about. So, with that, let's dive right into it.

Learning Objectives

In this chapter you will learn about the following:

How Hadoop addresses the key security requirements
- Authentication and authorization (인증 및 권한 부여)
- Audit and Administration (감사 및 관리)
- Data protection (데이터 보호)
A typical multi-layered deployment strategy for security (보안을 위한 다중계층 배포 전략)
- Implementing Kerberos
- Enhancing authentication and audit with Ranger (레인저와 인증 및 감사강화)
- Securing the perimeter with Knox
- Encrypting data at-rest and in-motion
The benefits and high-level architecture of Apache Ranger Apache Ranger의 이점과 높은 수준의 아키텍처
End-to-end security model using Hive as an example. Hive를 예로 든 종단 간 보안 모델

Enterprise Approach to Security

For as long as you are running Hadoop just for yourself or a trusted group of users, it may be ok not to worry about security. Hadoop을 사용자 또는 신뢰할 수 있는 사용자 그룹만 실행하는 한 보안에 대해 걱정하지 않아도 된다. However, any production Hadoop deployment in a serious enterprise setting will require you to implement a holistic security approach. 하지만 엔터프라이즈 환경에서 프로덕션에 Hadoop를 배포하려면 전체적인 보안 접근 방식을 구현해야 한다. In general, the enterprise approach to securing data in Hadoop revolves around five pillars, each answering its own fundamental question: 일반적으로 Hadoop에서의 데이터 보안에 대한 엔터프라이즈 접근 방식은 각각 5가지 기본 질문에 답해야 한다.

Data protection (데이터 보호) How can data on disk and data on the wire be protected from snooping? 스누핑으로부터 디스크의 데이터 및 와이어의 데이터 어떻게 보호할 수 있나?
Authentication (인증) How can I make sure that users are who they say they are? 사용자가 자신이 말하는 사람인지 어떻게 확인할 수 있나?
Authorization (권한부여) How can I easily manage what users can and can not do? 사용자가 수행할 수 있는 작업과 수행할 수 없는 작업을 쉽게 관리하려면 어떻게 해야 하나?
Audit (심사) How can I track what users do? 사용자가 수행한 작업을 어떻게 추적할 수 있나?
Administration (관리) How do I centralize all security management around Hadoop? Hadoop을 중심으로 모든 보안 관리를 중앙 집중화 하려면 어떻게 해야 하나?

Enterprise infrastructures must provide enterprise-grade coverage across each of these pillars since any single weakness will introduce additional threat vectors for Hadoop clusters.

Hadoop has grown very mature capabilities in each of these areas. This chapter will provide a whirlwind tour through those capabilities, but it is not meant to be a comprehensive guide to Hadoop security.

Data Protection

Before we get into how the Hadoop ecosystem protects itself from security threats, let's talk about when and how Hadoop hands data off to other layers in the system. Hadoop 생태계가 보안 위협으로부터 스스로를 보호하기 전에 Hadoop이 시스템의 다른 계층에 데이터를 전달하는 시기와 방법에 대해 이야기 해 보자.

This mainly boils down to:

Data at-rest or how Hadoop has to use the local filesystem to store actual blocks of data 휴지 상태의 데이터 또는 Hadoop이 로컬 파일 시스템을 사용하여 실제 데이터 블록을 저장하는 방법
Data in-motion or how Hadoop has to send blocks of data between different nodes in the cluster. 데이터 이동 또는 Hadoop이 클러스터의 다른 노드간에 데이터 블록을 보내는 방법

Both of these can be used as attack vectors. 두 가지 모두 공격 경로로 사용될 수 있다. For data at-rest, an attacker can try getting access to hard drives (for example, when they get decommissioned from a data center) and recover valuable data. 휴면 상태의 데이터의 경우 공격자는 하드 드라이브에 액세스 하고 중요한 데이터를 복구하려고 시도할 수 있다. For data in-motion, an attacker can try to penetrate a corporate network and snoop on all of the network traffic that will include blocks of data sent between different Hadoop nodes. 이동중인 데이터의 경우 공격자는 회사 네트워크에 침투하여 다른 Hadoop 노드간에 전송된 데이터 블록을 포함하는 모든 네트워크 트래픽을 스누핑 할 수 있다.

A standard practice to protect from both of these is the use of strong cryptography. 두 가지로 부터 보호하는 표준적인 방법은 강력한 암호화를 사용하는 것이다. Hadoop is working with its encryption partners to integrate HDFS encryption with enterprise-grade key management frameworks. Hadoop은 암호화 파트너와 협력하여 HDFS 암호화를 엔터프라이즈급 키 관리 프레임 워크와 통합한다.

Encryption in HDFS, combined with the KMS access policies maintained by Ranger, prevents attackers (including rogue Linux or Hadoop administrators) from accessing data. Ranger가 관리하는 KMS 액세스 정책과 결합된 HDFS의 암호화는 공격자가 데이터에 액세스하는 것을 방지한다. Data in-motion encryption capabilities are built right into Hadoop itself, while protecting data at-rest will require integration with various commercial and open source offerings. 데이터 인 모션 암호화 기능은 Hadoop 자체에 대장되어 있으며, 데이터를 보호하려면 다양한 상용 및 오픈 소스 제품과의 통합이 필요하다.

Authentication: Apache Knox and Kerberos

Using strong authentication to establish user identity is the basis for secure access in Hadoop. 강력한 인증을 사용하여 사용자 ID를 설정하는 것이 Hadoop의 보안 액세스의 기본이다. Once a user is identified, that identity is propagated throughout the Hadoop cluster and used to access resources (i.e. files and directories), and to perform tasks such as running jobs. 사용자를 식별하면 해당 ID가 Hadoop 클러스터 전체에 전파되어 리소스 (파일 or 디렉토리)에 액세스 하고 작업 실행과 같은 작업을 수행하는 데 사용된다.

Hadoop uses Kerberos, a standard, enterprise-grade industry solution for strong authentication in distributed systems. Hadoop은 분산 시스템의 강력한 인증을 위한 표준 엔터프라이즈 급 솔루션인 kerberos를 사용한다. Kerberos is not an Apache project, and there are multiple implementations of it available.

There is a standard open source implementation that is shipped by default with all Linux distributions, and there is a variety of popular commercial implementations, such as Microsoft’s Active Directory. 기본적으로 모든 Linux 배포판에서 제공되는 표준 오픈 소스 구현이 있으며 MS의 active directory와 같은 다양한 사용 구현이 있다.

Kerberos takes care of authentication needs within the Hadoop cluster perimeter itself. kerberos는 Hadoop 클러스터 경계 자체에서 인증 요구를 처리한다. A lot of times, though, users wish to access Hadoop cluster services from mobile or desktop environments that cannot be managed as part of the cluster. 하지만 사용자는 클러스터의 일부로 관리할 수 없는 모바일 또는 데스크톱 환경에서 Hadoop 클러스터 서비스에 액세스 하려고 한다. This is where Apache Knox comes in. 이를 위해 Apache Konx가 필요하다. Apache Knox gateway ensures perimeter security for Hadoop. Apache knox gateway는 Hadoop 주변 보안을 보장한다. With Knox, enterprises can confidently extend the Hadoop REST API to new users, without Kerberos complexities, while also maintaining compliance with enterprise security policies. Knox를 사용하면 기업은 kerberos 복잡성이 없이도 Hadoop REST API를 새로운 사용자로 확장 할 수 있을뿐만 아니라 기업 보안 정책 준수를 유지할 수 있다. Knox provides a central gateway for Hadoop REST APIs that have varying degrees of authorization, authentication, SSL, and SSO capabilities to enable a single access point for Hadoop. Konx는 Hadoop을 위한 단인 액세스 포인트를 가능하게 하는 권한, 인증, SSL 및 SSO 기능을 다양한 수준으로 갖춘 Hadoop REST API의 중앙 게이트 웨이를 제공한다.

With a combination of Kerberos and Knox, any Hadoop administrator can be sure that users are always who they say they are. User impersonation is only possible for authorized "super users".

We must note that properly deploying Kerberos from scratch and configuring various Hadoop services to use it can be a bit of a challenge, especially for the junior Hadoop practitioners without prior exposure to enterprise security.

The easier way to get going is to delegate that task to either Apache Ambari or Apache Bigtop.

Authorization: Apache Ranger

While services like HDFS may provide some coarse-grained authorization capabilities (such as controlling permissions on files and folders), when it comes to fine-grained authorization controls, we turn to Apache Ranger. HDFS와 같은 서비스는 파일과 폴더에 대한 권한 제어와 같은 강력한 인증 기능을 제공할 수 있지만 정교한 권한 제어에 대해서는 Apache Ranger를 사용한다.

Ranger manages fine-grained access control through a rich user interface that ensures consistent policy administration across Hadoop data access components. 사용자 인터페이스를 통해 세분화 된 액세스 제어를 관리한다.

Security administrators have the flexibility to define security policies for a database, table and column or a file, and administer permissions for specific LDAP-based groups or individual users. 보안 관리자는 데이터베이스, 테이블 및 열 또는 파일에 대한 보안 정책을 정의하고 특정 LDAP 기반 그룹 또는 개별 사용자에 대한 사용 권한을 관리 할 수 있다. Rules based on dynamic conditions, such as time or geography, can also be added to an existing policy rule. The Ranger authorization model is highly pluggable and can be easily extended to any data source using a service-based definition.

Ranger works with standard authorization APIs in each Hadoop component, and is able to enforce centrally-administered policies for any method of accessing all kinds of Hadoop data. Ranger는 각 Hadoop 구성요소의 표준 인증 API를 사용하며 모든 종류의 Hadoop 데이터에 액세스하는 모든 방법에 대해 중앙 관리 정책을 적용할 수 있다. Think of Ranger as a central authority for when it comes to allowing users to perform any kind of actions on the data stored in Hadoop clusters. Ranger는 사용자가 Hadoop 클러스터에 저장된 데이터에 대해 모든 종류의 작업을 수행 할 수 있도록 허용하는 중앙 권한으로 생각하자. With Ranger enabled, a Hadoop service running as part of the cluster can only perform an action on behalf of a given user after explicitly checking in with Ranger. Ranger가 활성화 되면 클러스터의 일부로 실행되는 Hadoop 서비스는 Ranger와 명시적으로 체크인 한 후 주어진 사용자를 대신하여 작업을 수행할 수 있다.

Audit: Apache Ranger and Apache Atlas

Ranger also provides a centralized framework for tracking users' actions and collecting audit history for easy reporting, including the ability to filter data based on various parameters.

Working together with Apache Atlas, Ranger makes it possible to gain a comprehensive view of data lineage and access audit, with an ability to query and filter audit data based on classification, users or groups, and other filters. Apache Atlas와 함께 작동하는 Ranger를 사용하면 분류, 사용자 또는 그룹 및 기타 필터를 기반으로 감사 데이터를 쿼리하고 필터링 할 수 있으므로 데이터 계보 및 액세스 감사에 대한 포괄적 인 시각을 확보할 수 있다. Apache Atlas is a new open source project which consists of a set of core foundational governance services that enable enterprises to meet their compliance requirements within Hadoop and allow integration with the complete enterprise data ecosystem.

These services include:

Search and lineage for datasets
Metadata-driven data access control
Indexed and searchable centralized auditing operational events
Data lifecycle management from ingestion to disposition
Metadata interchanged with other tools.

Ranger: A Single Pane of Glass for Security Administration

Given the central role that Ranger plays in Authorization and Audit, it should come as no surprise that it emerged as a “single pane of glass” for security administration in Hadoop clusters.

With Ranger, Hadoop administrators have a centralized user interface to deliver consistent security administration and management that can be used to define, administer, and manage security policies consistently across all the components of the Hadoop stack.

Ranger enhances the productivity of security administrators and reduces potential errors, by empowering them to define security policies once, and apply them to all the applicable components across the Hadoop stack from a central location.

To make the Hadoop administrator's job easier, Ranger comes with an extremely intuitive Web UI that, for a cluster with HDFS, YARN, Hive and a few other services, would look like this:

An administrator can drill down into each service's permissions model by simply navigating through the service representation and its properties.

Hadoop Security: Hive Query

There are quite a few moving parts in a fully secure Hadoop cluster. In order to appreciate their interaction, let's start with a simple example of a Hive client issuing a query on behalf of user Joe. 상호작용을 이해하기 위해 사용자 Joe를 대신하여 쿼리를 발행하는 Hive 클라이언트의 간단한 예부터 시작해 보자. The first thing that a client would do is initiate a DBC connection to the HiveServer 2, and actually submit the query. 첫번째 작업은 Hiveserver2에 대한 DBC 연결을 시작하고 실제로 쿼리를 제출하는 것이다.

Then HiveServer 2 will parse the query and submit it as a MapReduce job. Hiveserver2가 쿼리 구문을 분석하여 MR 작업으로 제출한다.

Now, for simplicity's sake, we are not showing the YARN here, but suffice it to say that YARN will take care of creating a bunch of processes on different nodes, and make them access the required data in HDFS.

Once everything is set and done, and the MapReduce job completes, it will return its result back to the HiveServer 2. 모든 것이 설정되고 완료되면 MR작업의 결과가 다시 Hiveserver2로 반환된다. HiveServer 2 will incorporate that result in the query answer that it will then return to the client. HiveServer2는 그 결과를 쿼리 응답에 통합하여 클라이언트에 반환한다.

This is a very simple scheme, and it's something that we've already seen in the previous chapter with our simplified deployments of Hadoop. While easy to deploy, this setup doesn't protect against the most basic exploit: 설치는 쉽지만 설정은 가장 기본적인 공격으로 부터 보호하지 못한다. a malicious Hive client pretending to be a superuser trying to get access to all the data in the cluster that it shouldn't really be able to see. A more subtle problem here is that, even though we are implicitly trusting something like HiveServer 2 to be what it claims it is, there is absolutely no way to guarantee it. Hiveserver2와 같은 것을 암시적으로 신뢰한다고 해도 수퍼유저인 척하는 악의적인 클라이언트를 완전히 막을 방법이 전혀 없다는 것이다. In fact, HiveServer 2 could also be a malicious user, masquerading like a legitimate HiveServer 2, trying to pull off a man-in-the-middle attack. So, how do we fix this? How do we make sure that any piece of what we see on screen can trust any other piece? Well, for starters, let's enable strong authentication. So, the way we would do it is by introducing the cluster-wide Kerberos database. 클러스터 전체의 kerberos 데이터베이스를 도입하는 것이다. This is the blue database at the bottom. Think of the Kerberos database as a centralized repository of all the user credentials. kerberos 데이터 베이스는 모든 사용자 자격 증명의 중앙 저장소로 생각하라. Any user wishing to prove that they are indeed who they say they are, will prove that first to the Kerberos, obtain a special token, and then, use that token within the cluster to prove their identity.

Now, interestingly enough, this scheme works not only for human users, like our user Joe, but also for services like HiveServer 2, HDFS, YARN, or anything else that may be running on the cluster. 흥미롭게도 이러한 인증은 사용자 Joe뿐 아니라 HiveServer2, HDFS, YARN 또는 클러스터에서 실행될 수 있는 다른 서비스와도 동작한다. Long story, short a cluster-wide Kerberos is like a Mafia boss who knows everyone and can vouch for the identity of various parts of his circle. But, how large is the circle, you may ask. Well, it definitely includes anything that's happening within the cluster itself. Now, as for human users, a lot of times, they will be known to a different repository. So, in our case, let's introduce Microsoft Active Directory. This is the blue database at the left, as a repository to which all of the human users are known, including our user Joe. So, as it happens, user Joe is known to the Active Directory, but it isn't known to the Kerberos database. Do we have a problem? What happens then? Well, fear not. Just like Mafia bosses can sometimes trust each other and vouch for other members, we can actually establish a one-way trust between Active Directory and the Kerberos database. And user Joe can use that. So, once we have that one-way trust, a sequence of events on a cluster would look something like this. So, first of all, all of the services, like HiveServer 2, HDFS, YARN, or anything else like that, would start up, they would prove their identity to the Kerberos, and they would get back the token that they can keep using for some period of time. What happens to our user Joe? Well, user Joe has to prove his identity to the Active Directory first. But, since he also needs the same token from the centralized Kerberos database, he will then forward that identity to the Kerberos, and, because of the one-way trust that we enabled between the Active Directory and the Kerberos, it will receive the very same token that all of the non-human services in the cluster received. So now, we have a situation where any piece of the cluster can trust any other piece of the cluster, including the human users. The rest of the flow with the query is exactly like what we've seen before, with one interesting addition: every single network request will actually include the identity of whoever is making that request, and that identity will be based on the token that was received from the Kerberos database. The next question on our list, of course, is that even if we can trust everything that happens here, from an authentication standpoint, what happens if we have user Joe submitting a query using data that the user Joe shouldn't really have access to? Well, in order to solve that problem, let's introduce authorization. Once the user, like our user Joe, is authenticated, the authorization will be performed by Apache Ranger. Apache Ranger is a service that has its tentacles in all of the cluster components, and, before any component can do any piece of work on behalf of a given user, it actually makes sure that Ranger would approve of this action. Ranger synchronizes the user group repository with something like Microsoft Active Directory, as we're showing at the top, and that would contain the detailed policies of what kind of data the user has access to, things like Hive tables, HDFS files, etc. So, then HiveServer 2 will pass the user ID and request to the Ranger policy, and will receive a reply back, saying whether that user is allowed to do what the user is intending to do. This, of course, gives us a very fine-grained access control, and also a single point of policy administration. But the cluster services are still exposed to any kind of malicious traffic that may be coming from the client. So, the next layer of security to build up is a strong perimeter security with Knox. First of all, we will isolate all of the cluster components in a fortified perimeter, so that network traffic can only be allowed within that perimeter, and the only component capable of talking to the outside world will be Apache Knox. Knox will provide a gateway that all clients will have to pass through to get to the cluster. Direct access to the cluster services, like direct access to HiveServer 2, will not be allowed. This helps protecting the cluster, and it also creates a centralized authentication point. Basically, to simplify client access, all clients start by contacting the Apache Knox, which then contacts Kerberos on their behalf. So, then we can actually have individual clients not needing to worry about interfacing with the Kerberos directly, because Knox will do it for them. This will also provide us with a central audit point of any request that would come into the cluster. Clients such as our user Joe will connect to Knox using a standard user ID and password combination, and let Knox create the necessary Kerberos tokens, based on those credentials. This actually looks much more secure already. But, let's go one step further, and introduce encryption for data in motion - anything that travels on the network, and data at rest, anything that is actually being stored on disk. We will do that by basically enabling SSL for all of the network traffic, and enabling encryption data for disk on HDFS. So, first, we will start with enabling SSL. Look at how each of the network connections on our diagram is now protected by a green SSL sleeve, and how blocks of data stored in HDFS are now being protected by the green locks. Those locks are in fact coming from a Hadoop feature known as transparent encryption. At this point, we have a fully secured cluster and, better yet, we made it in a fashion that is fully transparent to our original Hive client. Our Hive client can still issue the same query as it did on our first slide. Basically, you can see that with Hadoop strong security doesn't really have to come at the expense of usability, and enabling it, should really be one of your top priorities on any real-world production deployment.

Internal Ranger Architecture

Given how central the role played by Ranger is in enabling comprehensive enterprise Hadoop security, let us recap its architecture, showing various integration points. Don't worry if you don't understand all of it. Just keep it in mind for future reference.

In the diagram above, you could see five different services running as part of a shared Hadoop cluster: HDFS, HiveServer 2, Knox, HBase, and Storm.

Each contains a piece of code (called Ranger Plugin) that intercepts actions performed on behalf of different users in the enterprise, and reports that intent to perform an action back to the central Ranger server shown in blue at the top of the diagram.

That allows Ranger to enforce fine-grained polices (via the Ranger Policy Server), and also keep track of actions as they happen (via the Ranger Audit Server), thus enabling future audits and compliance reports. Ranger는 Ranger Policy Server를 통해 세분화 된 정책을 적용하고 Ranger Audit Server를 통해 작업을 추적하여 향후 감사 및 준수 보고서를 작성할 수 있다. Given the distributed scale of even a small Hadoop cluster, Ranger is trying to get out of the way pretty quickly, and, most of the time, doesn't present a bottleneck for bulk data operations. Ranger는 빠른 속도로 처리하려고 노력하며 대부분의 경우 대량 데이터 작업에 병목 현상이 나타나지 않는다.

nowol79 / MOOC