AWS AppSync is a fully managed service to develop GraphQL APIs as AppSync helps in managing the scaling and connections of GraphQL APIs.
Immutable architecture - For instance, if a container in a distributed system goes down, a new container with the same configuration is immediately spun up.
The recovery point objective (RPO) is the acceptable amount of loss incurred during recovery.
SNS - directly to an application (A2A) or to a person (A2P)
note on SNS : If the bounce rate for your account exceeds 10%, we might temporarily pause your account’s ability to send email, so SNS topic for bounced emails
Lambda
example
Glue: Catalog, ETL
Serverless ETL
Map reduce: intermediate data/slower/cheaper/batch
Spark: In mem, faster, expensive, interative algo. Livy to submit jobs to Spark via REST. Spark has: MLLib: Classi, recom, cluster. Mohout: class, recom, clustering, distributed algo
MXnet & tensorflow framework on hadoop
Jupyter (established) and Zeppelin (multi user) has notebook.
HUE: hadoop web exp. Exe SQL, manage files
Spark SQL (Avro, orc, JDBC, json, Parquet). vs Presto: (Use Athena, complicated)
Hive: query -> map reduce and HDFS ops. Metastore: glue catalog (like AWS glue catalog). HQL . structured
Hcatalog: connect to hive metastore
Pig: Data processing in HLL -> map to data processing jobs. Semi structured
Hbase: KV, variable scheme,
Phoenix: OLTP over Hbase. with JDBC
OLTP to Hadoop : Scoop
OOzie: workflow scheduler, DAG actions
Tez makes map reduce faster. Tez UI
Flink: faster than spark streaming
Zookeeper: config info, which nodes alive
Ganglia: stats on cluster
EMR
AWS EMR: managed Hadoop cluster. Big Datasets. Use s3 (s3DistCp hdfs to s3)
Decouple storage (hadoop 128MB blocks, Local file, EMRFS is HDFS on S3) & compute
Yarn to manage rsc on Master and has Gaglia /Zeplin. Master manage core nodes (hdfs).
Task Nodes for CPU intensive (better with spot)
Transient (auto terminate, init of 15 min, data explore, exp) vs Long running (manual terminate, ML, always on)
CLuster: start, boostrap, running, waiting, running, shutting down, failed, completed, terminated : depends on Longrun/transient
Instance type: Batch(m4), ML (C4), ML (P3 GPU), Large HDFS - D2, Interactive Analysis (X1 mem)
in EMR
Kinesis -> [Spark streaming + Flink + EMR]
AWS data pipeline
Availabe, Notif, manage (schedule, error), Drag drop, AWs inteegration.
Components: Data nodes (dynao, redshift, s3), activities (copy, emr, hive, red copy, sql, shell), Others (sched, rsc, pre condition, actions)
Param
edit in architect
Data pipeline service s3 transfer, RDS export/import, RDS copy, dynamodb, emr steps, onpremise rsc.
Alternative: step func, Simple workflow, Oozie , Luigi (python workflows)