jasperzhong / read-papers-and-code

My paper/code reading notes in Chinese
43 stars 3 forks source link

SOSP '13 | Discretized Streams: Fault-Tolerant Streaming Computation at Scale #229

Open jasperzhong opened 3 years ago

jasperzhong commented 3 years ago

https://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf

Spark Streaming

jasperzhong commented 3 years ago

参考这篇blog

https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html

核心idea就是batch online input data. 所以叫做discretized streaming (DStream).

其他的优势来自于spark本身,比如有一个scheduler可以动态调度(dynamic load balancing), RDD (fast failure recovery,而且如果只是task还可以并行恢复). 这个DStream其实就统一了batch, streaming和interactive analysis三种情况.

jasperzhong commented 3 years ago

introduction里面提到了_upstream backup_这个词. 对它的解释是

nodes buffer sent messages and replay them to a new copy of a failed node.

这不就是optimistic logging么. 给了三个citations. 我仔细看看. 其中一个是 #230 .

也说了其缺点

upstream backup takes a long time to recover, as the whole system must wait for a new node to serially rebuild the failed node's state by rerunning data though an operator.

其实也不然,有时候也是可以并行恢复的.

这句话我没看懂

in upstream backup, a straggler must be treated as a failure, incurring a costly recovery step.

为啥用了upstream backup ,straggler就是failure了?