Chapter 4. Parallel Processing - quiz

copyright by edX PageLinuxFoundationX: LFS103x Introduction to Apache Hadoop

Knowledge Check 4.1 0.0/1.0 point (graded)
What are the phases of MapReduce? Select all answers that apply. A. Map B. ETL C. Combiner D. Reduce E. Sort/Shuffle

이거 틀림.. ;; ETL이라는 개념이 처음인데.. Extract, Transform and Load 란다. Data Warehouse 대표적인 용어인듯..

ETL은 데이터의 추출(Extraction)·변환(Transformation)·적재(Loading)의 약자로 비즈니스인텔리전스(BI) 구현을 위한 기본 구성요소 가운데 하나다. ETL툴은 다양한 원천 데이터를 취합해 데이터를 추출하고 하나의 공통된 포맷으로 변환해 데이터웨어하우스(DW)나 데이터마트(DM) 등에 적재하는 과정을 지원하는 툴을 의미한다.

Knowledge Check 4.2
1.0/1.0 point (graded) What determines the number of Mappers? Select the correct answer. A. The number of nodes in the cluster B. The number of blocks of input data stored in HDFS C. YARN
Knowledge Check 4.3
1.0/1.0 point (graded) What determines the number of Reducers? Select the correct answer. A. The number of nodes in the cluster B. The number of blocks of input data stored in HDFS C. YARN D. User setting (with 1 being the default)
Knowledge Check 4.4 0.0/1.0 point (graded) It is possible to have a MapReduce job that ____. Select all answers that apply. A. Has no Mappers B. Has no Reducers C. Does not go through a Shuffle phase

틀림. MR작업은 무조건 Mappers/Reducers 작업을 거쳐야 된다고 생각했는데.. 아니란다. 실제로 Streaming 작업할 때는 SetReducer 옵션을 none로 설정해서 돌리는 경우도 있다고 함

Knowledge Check 4.5
0.0/1.0 point (graded) Both Mappers and Reducers are _____. Select all answers that apply. A. Called with a single key-value pair B. Called with multiple key-value pairs C. Return a single key-value pair D. Return multiple key-value pairs

이것도 틀림 ;; single key/ multiple key 개념이 헷갈린듯.. input으로 들어가는 거는 무조건 single key-value pair이고.. output으로 나오는거는 multiple key-values 라는 건데... @,.@ 이건 잘 모르겠다...

Knowledge Check 4.6
1.0/1.0 point (graded) What does RDD stand for? Select all answers that apply. A. Resilient B. Rational C. Distributed D. Damaged E. Dataset
Knowledge Check 4.7
1.0/1.0 point (graded) Which of the following languages support Spark? Select all answers that apply. A. R B. Java C. C++ D. Scala E. Python F. Ruby
Knowledge Check 4.8
0.0/1.0 point (graded) RDDs are not evaluated or persisted by default. True or False? A. True B. False
Knowledge Check 4.9
1.0/1.0 point (graded) Which Spark extensions introduced Dstreams? Select the correct answer. A. MLlib B. Spark Streaming C. Datasets D. Spark Core
Knowledge Check 4.10
1.0/1.0 point (graded) Why is Spark faster than MapReduce? Select all answers that apply. A. It does in-memory caching B. It replaces HDFS C. It can accommodate data pipelines of arbitrary size without writing intermediate data to HDFS D. It forces you to use SQL
Knowledge Check 4.11
1.0/1.0 point (graded) To what part of Hive clients make a JDBC connection to? Select the correct answer. A. Tez B. Spark SQL C. MapReduce D. HiveServer2 E. MetaStore
Knowledge Check 4.12
1.0/1.0 point (graded) What execution engine does Hive support? Select all answers that apply. A. MapReduce B. Tez C. HiveServer2 D. Spark
Knowledge Check 4.13
1.0/1.0 point (graded) What is the primary difference between external and native tables in Hive? Select the correct answer. A. External tables are kept in HDFS, and native tables are kept in-memory B. When a native table is dropped, all of the files in HDFS are deleted, but, when an external table is dropped, files remain C. External tables are slower to query from compared to native tables
Knowledge Check 4.14
1.0/1.0 point (graded) A native table can reside anywhere in HDFS. True or False? A. True B. False
Knowledge Check 4.15
1.0/1.0 point (graded) What is the supported format of files that comprise Hive tables? Select the correct answer. A. CSV (Comma Separated Values) B. ORC C. Parquet D. Avro

Key Points to Remember I

The main ideas we discussed in this chapter are summarized below:

A key to any kind of parallel data processing is to be able to subdivide your job into as many independent parts as possible, and assign work to different nodes in the cluster, all running simultaneously
The biggest issue to watch out for when considering performance gains from parallel data processing is how much of your overall job can be parallelized. This is known as Amhdal’s Law of upper bound on performance increase
Hadoop’s HDFS and YARN enable a multitude of different kinds of parallel processing frameworks, and the key ones are:
- MapReduce: the original Hadoop parallel processing framework
- Apache Spark: the in-memory improvement over MapReduce
- Apache Hive: an SQL-driven parallel processing framework with Enterprise Data Warehouse capabilities.

Key Points to Remember II

The main ideas we discussed in this chapter are summarized below (continued):

MapReduce is the foundational framework for parallel data processing at scale, because of its ability to break a large problem into many smaller ones
MapReduce is built on two key processing primitives: Mappers and Reducers working with data as Key Value Pairs (KVP)
Mappers read data in the form of KVPs, and each call to a Mapper is for a single KVP; it can return 0..m KVPs
The framework shuffles & sorts the Mappers’ outputted KVPs with the guarantee that only one Reducer will be asked to process a given Key’s data
Reducers are given a list of Values for a specific Key; they can return 0..m KVPs
Due to the fine-grained nature of the framework, many use cases are better suited for higher-order tools.

Key Points to Remember III

The main ideas we discussed in this chapter are summarized below (continued):

MapReduce suffers from limitations stemming from its rigid paradigm: it forces complex computations to put results back into HDFS before the next step of the computation can begin
MapReduce is also very low-level in that its programming API is not providing primitives like filtering and aggregation directly to the user
Even though MapReduce is still part of Hadoop, Apache Spark was introduced as a better alternative
Spark houses data in an RDD structure and allows re-parallelization as needed
The “sweet spot” for Spark is iterative in-memory computations and interactive data modeling in Python, Scala, Java, and R languages
Spark is built around Spark Core, but it also provides data processing, ETL, machine learning, stream processing, and SQL querying
Spark Streaming uses micro-batches that are much like RDDs loaded from disk
Lately, Spark has embraced a more relational API around Datasets, which brings it much closer to a classical SQL-driven MPP parallel data processing framework on Hadoop.

Key Points to Remember IV

Bookmark this page The main ideas we discussed in this chapter are summarized below (continued):

Apache Hive is an SQL-driven parallel processing framework that brings Enterprise Data Warehouse (EDW) capabilities to Hadoop
Hive uses the familiar table and SQL metaphors that are used with classic EDW and Relational Databases.
The SQL nature makes Hive ideally suited to interoperate with existing EDW tools and leverage existing skills in the enterprise
To maintain an appearance of a relational database, Hive adds two components to a Hadoop cluster:
- The MetaStore maintains the logical view of tables, as well as the physical characteristics, such as where the data is stored and in what format it is in
- The HiveServer2 component provides a familiar JDBC interface to existing tools and is tasked with receiving queries from clients, parsing the queries, coming up with an execution plan, and, finally, submitting it into the worker nodes for processing
Just like any relational database, Hive can create, populate, and query tables
Views are supported, but they are not materialized
Hive’s two main execution engines are MapReduce and Tez
The Hive community is working on making Spark available as yet another execution engine for Hive queries

nowol79 / MOOC