Notes on Processing Data On AWS

Lambda

example

def lambda_handler(event, context):
    return {
        "statusCode": 200,
        "headers": { "Content-Type": "application/json" },
        "body": json.dumps({ ... })
    }

Glue: Catalog, ETL

Serverless ETL

Map reduce: intermediate data/slower/cheaper/batch

Spark: In mem, faster, expensive, interative algo. Livy to submit jobs to Spark via REST. Spark has: MLLib: Classi, recom, cluster. Mohout: class, recom, clustering, distributed algo

MXnet & tensorflow framework on hadoop

Jupyter (established) and Zeppelin (multi user) has notebook.

HUE: hadoop web exp. Exe SQL, manage files

Spark SQL (Avro, orc, JDBC, json, Parquet). vs Presto: (Use Athena, complicated)

Hive: query -> map reduce and HDFS ops. Metastore: glue catalog (like AWS glue catalog). HQL . structured

Hcatalog: connect to hive metastore

Pig: Data processing in HLL -> map to data processing jobs. Semi structured

Hbase: KV, variable scheme,

Phoenix: OLTP over Hbase. with JDBC

OLTP to Hadoop : Scoop

OOzie: workflow scheduler, DAG actions

Tez makes map reduce faster. Tez UI

Flink: faster than spark streaming

Zookeeper: config info, which nodes alive

Ganglia: stats on cluster

EMR

AWS EMR: managed Hadoop cluster. Big Datasets. Use s3 (s3DistCp hdfs to s3)

Decouple storage (hadoop 128MB blocks, Local file, EMRFS is HDFS on S3) & compute

Yarn to manage rsc on Master and has Gaglia /Zeplin. Master manage core nodes (hdfs).
Task Nodes for CPU intensive (better with spot)

Transient (auto terminate, init of 15 min, data explore, exp) vs Long running (manual terminate, ML, always on)

CLuster: start, boostrap, running, waiting, running, shutting down, failed, completed, terminated : depends on Longrun/transient

Instance type: Batch(m4), ML (C4), ML (P3 GPU), Large HDFS - D2, Interactive Analysis (X1 mem)

in EMR

hive
create database sales;
use sales

Kinesis -> [Spark streaming + Flink + EMR]

AWS data pipeline

Availabe, Notif, manage (schedule, error), Drag drop, AWs inteegration.

Components: Data nodes (dynao, redshift, s3), activities (copy, emr, hive, red copy, sql, shell), Others (sched, rsc, pre condition, actions)

Param

edit in architect

Data pipeline service s3 transfer, RDS export/import, RDS copy, dynamodb, emr steps, onpremise rsc.

Alternative: step func, Simple workflow, Oozie , Luigi (python workflows)

Jithin Jude Paul - Distributed Serverless Architectures on AWS

AWS AppSync is a fully managed service to develop GraphQL APIs as AppSync helps in managing the scaling and connections of GraphQL APIs.
Immutable architecture - For instance, if a container in a distributed system goes down, a new container with the same configuration is immediately spun up.
The recovery point objective (RPO) is the acceptable amount of loss incurred during recovery.
SNS - directly to an application (A2A) or to a person (A2P)

example way to decode kafka events in lambda

payload = base64.b64decode(event['Records'][0]['kinesis']['data']).decode('utf-8')

note on SNS : If the bounce rate for your account exceeds 10%, we might temporarily pause your account’s ability to send email, so SNS topic for bounced emails
RDS failure events

typo in chapter 4 - AKB

kurtzace / diary-2024