kurtzace / diary-2024

0 stars 0 forks source link

Notes on Processing Data On AWS #3

Open kurtzace opened 10 months ago

kurtzace commented 10 months ago

Lambda

example

def lambda_handler(event, context):
    return {
        "statusCode": 200,
        "headers": { "Content-Type": "application/json" },
        "body": json.dumps({ ... })
    }

Glue: Catalog, ETL

Serverless ETL

Map reduce: intermediate data/slower/cheaper/batch

Spark: In mem, faster, expensive, interative algo. Livy to submit jobs to Spark via REST. Spark has: MLLib: Classi, recom, cluster. Mohout: class, recom, clustering, distributed algo

MXnet & tensorflow framework on hadoop

Jupyter (established) and Zeppelin (multi user) has notebook.

HUE: hadoop web exp. Exe SQL, manage files

Spark SQL (Avro, orc, JDBC, json, Parquet). vs Presto: (Use Athena, complicated)

Hive: query -> map reduce and HDFS ops. Metastore: glue catalog (like AWS glue catalog). HQL . structured

Hcatalog: connect to hive metastore

Pig: Data processing in HLL -> map to data processing jobs. Semi structured

Hbase: KV, variable scheme,

Phoenix: OLTP over Hbase. with JDBC

OLTP to Hadoop : Scoop

OOzie: workflow scheduler, DAG actions

Tez makes map reduce faster. Tez UI

Flink: faster than spark streaming

Zookeeper: config info, which nodes alive

Ganglia: stats on cluster


EMR

AWS EMR: managed Hadoop cluster. Big Datasets. Use s3 (s3DistCp hdfs to s3)

Decouple storage (hadoop 128MB blocks, Local file, EMRFS is HDFS on S3) & compute

Yarn to manage rsc on Master and has Gaglia /Zeplin. Master manage core nodes (hdfs).
Task Nodes for CPU intensive (better with spot)

Transient (auto terminate, init of 15 min, data explore, exp) vs Long running (manual terminate, ML, always on)

CLuster: start, boostrap, running, waiting, running, shutting down, failed, completed, terminated : depends on Longrun/transient

Instance type: Batch(m4), ML (C4), ML (P3 GPU), Large HDFS - D2, Interactive Analysis (X1 mem)

in EMR

hive
create database sales;
use sales
image

Kinesis -> [Spark streaming + Flink + EMR]


AWS data pipeline

Availabe, Notif, manage (schedule, error), Drag drop, AWs inteegration.

Components: Data nodes (dynao, redshift, s3), activities (copy, emr, hive, red copy, sql, shell), Others (sched, rsc, pre condition, actions)

Param

image

edit in architect

Data pipeline service s3 transfer, RDS export/import, RDS copy, dynamodb, emr steps, onpremise rsc.

Alternative: step func, Simple workflow, Oozie , Luigi (python workflows)

kurtzace commented 8 months ago

Jithin Jude Paul - Distributed Serverless Architectures on AWS

example way to decode kafka events in lambda

payload = base64.b64decode(event['Records'][0]['kinesis']['data']).decode('utf-8')

typo in chapter 4 - AKB