jasperzhong / cs-notes

CS认知体系
6 stars 0 forks source link

Cambridge 20/21 | Concurrent and Distributed Systems (partIB) #1

Closed jasperzhong closed 3 years ago

jasperzhong commented 3 years ago

https://www.youtube.com/playlist?list=PLeKd45zvjcDFUEv_ohr_HdUFe97RItdiB

notes: https://www.cl.cam.ac.uk/teaching/2021/ConcDisSys/dist-sys-notes.pdf

讲课的是DDIA作者Martin Kleppmann. 太赞了.

话说英国口音真好听(逃

jasperzhong commented 3 years ago

Lecture 1

主要是回顾network基础

RPC: image 中间那个序列化过程叫做marshal / unmarshal, 做这件事的叫stub.

REST居然也是rpc, 消息走的是json.

microservices之间一般通过rpc通信.

interface definition language(IDL): 语言独立的API specification. 比如protobuf.

jasperzhong commented 3 years ago

Lecture 2: Models of distributed systems

两军问题

由于信道不可靠,导致两个人无法达成共识.

比如A发消息给B,有两种情况

  1. B收到了,但是ack消息挂了.
  2. B没收到.

这两种情况对于A来看是一样的——都是no response. 但是对于B来讲又是两件事.

A有两个选择:

  1. 一定attack: 多次发送相同消息. 只要有一个消息成功那就成功了. 但仍有可能失败.
  2. 只有收到ack才attack: 但这样B就处于和A相同的境地了. 因为B不知道ack消息A是否收到. 这样会无限死循环下去. (猜疑链)

这个问题被证明是无解的.

TCP就是这样采用方案1,没收到ack就重传,直到timeout.

拜占庭将军问题

更general的问题: 一些节点会behave maliciously , 比如会恶意背叛.

Theorem: 需要3f + 1个节点才能满足f个Byzantine faults.

Systems models

这个总结很精辟.

而现实系统中,node和network都会出现故障.

考虑三个方面

network behaviour

network partition: 一部分links断了.

之间转换:

node behaviour

synchrony(timing) assumptions

这里的"同步" "异步"概念和计算机领域其他地方出现的不太一样.

为何network会产生很高的latency

为何node会偶尔停止执行

确实.

针对场景选择一个合适的system model很重要. 比如on-prem cluster就不要去考虑arbitrary network和byzantine fault了.

Fault Tolerance

fault和failure意思不一样.

容错: 出现故障后系统能继续工作.

failure detector: 判断node是否故障. 一般通过心跳包是否超时判断.

这个我记得redis有一个不错的设计: 需要多数表决某个node是否宕机. https://medium.com/kokster/fault-tolerant-redis-architecture-35334ef5de0b

jasonliu747 commented 3 years ago

给钟哥点赞:thumbsup:

jasperzhong commented 3 years ago

Lecture 3: Time, clocks and ordering of events

Physical time

现在一般用GPS作为物理时间源了. 卫星携带atomic clock,广播时间. 所以datacenter屋顶一般有天线.

leap second: 闰秒,传说中的“+1s”和"-1s". 发生在每年30 June和31 Dec. 现在软件一般都会忽略. 但有时候真会make a difference....

image

NTP: network time protocol 同步物理时间的协议. 分NTP client和NTP server两种节点. 如何做同步呢? image

这里面有一个编程上的坑...

image

这个坑以前还真不知道... cxxreference对于std::chrono::system_clock有一段话:

The system_clock's time value can be internally adjusted at any time by the operating system, for example due to NTP synchronization or the user changing the system's clock. Daylight Savings Time and time zone changes, however, do not affect it since it is based on the UTC time-zone.

看来以后测量interval要用std::chrono::steady_clock...

Class std::chrono::steady_clock represents a monotonic clock. The time points of this clock cannot decrease as physical time moves forward and the time between ticks of this clock is constant. This clock is not related to wall clock time (for example, it can be time since last reboot), and is most suitable for measuring intervals.

不要用std::chrono::high_resolution_clock,这个只是上面两个clock的alias,但具体是哪一个不同实现又不一样. 所以不要用.

Causality and happens-before

happens-before relation是partial order. 可能 a->b b->a都不成立,这时 a和b是concurrent的,记作 a || b.

causality(因果):

happens-before有潜在的causality.

causal order是一个partial order.

jasperzhong commented 3 years ago

Lecture 4: Broadcast protocal and logical time

logical time

logical clock其实就是个counter, designed to capture causal dependencies. i.e. a -> b => T(a) < T(b)

Lamport clocks

image

(L(e), N(e))可以uniquely identifies event e. N(e)是指node name顺序. 这样可以定义一个total order.

缺点:

Vector clocks

用一个vector而不是scalar才存counter.

image

用generalized inequality来确定order.

properties of this order:

这个不错.

Broadcast protocols

目前看到的都是p2p communication. 考虑group communication.

起到一个中间件作用. 上面是应用,下面是网络. image

Reliable broadcast: 根据order顺序分类

  1. FIFO broadcast: messages sent by the same node must be delivered in the order they were sent. (node: messages sent by different nodes can be delivered in any order)
  2. Causal broadcast: causally related message must be delivered in causal order. (note: concurrent messages can be delivered in any order)
  3. Total order broadcast: all nodes must diliver messages in the same order (note: 并行事件需要确定一个顺序,无论什么顺序,只要所有节点收到的顺序一致就行)

之间关系. image

有点好奇...NCCL, OpenMPI这些中间件是实现了哪一种broadcast...

Broadcast algorithms

分成两个问题:

  1. 如何实现reliable broadcast 2, 如何实现各种order

对于问题1的一些办法:

对于问题2的一些办法: 都需要一个buffer.

  1. FIFO broadcast:

image

很简单,就是deliver顺序一定要从0 count到N.

  1. Causal broadcast algorithm: 其实就是利用vector clock实现causality. 方法类似.
  2. total order broadcast: 比较麻烦,因为需要有共识——到底是什么顺序.
    • single learder approach: 一切顺序以leader为准. 如果一个节点要发送消息,发给leader. leader去做broadcast. 问题是leader crashes等等.
    • Lamport clocks approach: 利用(L(e), N(e))构成的total order.

这两个办法都没有实现fault tolerance.

jasperzhong commented 3 years ago

Replication

终于要开始fault tolerance话题了. replication算是最widely used的一个方法了.

好处:

最大的问题: consistency

举了一个点赞的例子. 非常有意思.

idempotence: 幂等. f(x) = f(f(x)).

比如会出现下面这样的inconsistency问题,而且还无法区分是哪种情况,在server看来结果都一样. image

解决办法: 利用timestamp (得用logical time). 下图中false不是真的remove,而是指暂时invisible,叫tomestone (因为可能其他node收到更新的true消息). tomestone可以之后gc. 这样可以tell哪个消息是更新的. image

然后replicas之间周期性sync,检查所有inconsistencies,以最新的写入为准.

这里的logical clock用lamport或者vector都可以:

Quorums

importance of fault tolerance: achieve high-availability

problem: read-after-write consistency

image

solution: quorum机制 e.g. 2 out of 3, 3 out of 4.

其实就是鸽笼原理. 最后读出来的可能有一个是之前write的value. 根据timestamp就能得到最新值.

State machine replication (SMR)

考虑用broadcast protocol做replication.

total order broadcast: every node delivers the same messages in the same order.

State machine replication (SMR):利用(FIFO-)total order broadcast同步更新update到所有replicas. replica是一个state machine,开始状态都一样,由于每个replica收到的update顺序都一样,所以保证了replica的最后状态一定一样.

limitations:

能否用更弱的broadcast, 比如causal broadcast? 看情况,如果replica state updates是commutative的,如下表所示:

image

jasperzhong commented 3 years ago

Consensus

最重要的一章. 主要掌握Raft.

先抛出问题: 如何做fault-tolerance的total order broadcast. 比如,leader-based的实现中,如果leader挂了或者unavailable怎么办?

manual failover,即人为干预. 人的速度太慢了,需要很长时间才能recovery. 必须要实现automatically choose a new leader. 这就是consensus.

consensus: 几个node需要对a single value达成共识. 在total order broadcast的context下, 这个value指的是the next message to deliver.

常见的consensus algorithms:

主要看Raft.

Raft

consensus system models Paxos, Raft assume a partially synchronous, crash-recovery system model.

为何不选择asynchronous的model呢?

p.s. 针对partially synchronous Byzantine system model的consensus algorithm: blockchains...

Leader election

Raft使用leader来sequence messages.

每个term <= 1个leader

如何保证只有一个leader? Example: leader节点由于网络原因unavailability了,其他节点选举一个新leader in term t+1. 但此时原leader节点不知道已经选出了一个新的leader了.

所以做每个decision (deliver message)前,leader需要问followers我是不是你的leader. 如果得到了quorum个回答,就有效.

Node state transitions in Raft

节点状态本身是个state machine.

mpv-shot0019

Implementation

这集太长了. 直接看视频walk through吧...

https://www.youtube.com/watch?v=IPnesACYRck

jasperzhong commented 3 years ago

Replica consistency

先给个consistency定义

Two-phase commit

先看看atomicity in ACID transactions

但这和consensus不太一样. 具体对比如下: image

一般实现atomicity in ACID是通过two-phase commit (2PC) image

如果coordinator crashes.

problem: 如果coordinator在prepare后broadcasting decision之前挂了,那么replicas会一直block直到coordinator recovers.

solution: 用total order broadcast实现fault-tolerance two-phase commit.

Linearizability (线性一致性)

等同于strong consistency. 和serializability不是一个意思.

Linearizability: all operations behave as if executed on a single copy of the data. (focus on client-observable behaviour, ignore how the replication system is implemented internally)

Informally: every operation takes effect atomically sometime after it started and before it finished.

大概意思是如果read-after-write, 要保证read的是最新值. 如果是read和write有overlap,那么不能保证该read操作得到的是最新值.

image

image

另外,quorum机制并不能保证linearizability. 如果要保证,需要client收到inconsistency的数据后,把new value写入到那些还存old value的replicas.

image

还可以用total order broadcast + CAS实现 image

Linearizability advantages:

Downsides:

Eventual consistency

CAP theorem

eventual consistency: replica用local state处理operations.

strong eventual consistency:

好处:

总结

Problem Must wait for communication Requires synchrony
atomic commit all participating nodes partially synchronous
consensus, total order broadcast, linearizable CAS quorum partially synchronous
linearizable get/set quorum asynchronous
eventual consistency, causal broadcast, FIFO broadcast local replica only asynchronous
jasperzhong commented 3 years ago

Concurrency control in applications

collaboration software

比如google docs. 每个用户其实都有一个local replica. 可以update local replica anytime(甚至offline). sync with others when network available.

algorithms:

不是很感兴趣. 没看算法细节.

Google Spanner

集大成之作

a data system with millions of nodes, petabytes of data, distributed across datacenters worldwide.

consistency properties:

使用的techniques:

有趣的点: read-only transactions require no locks.

MVCC: multi-version concurrency control image

为了得到commit timestamps. 需要满足causality,即T1 -> T2 => t1 < t2. 本来是用logical clocks来解决这个问题的. 但是linearizability取决于real-time order,logical time是无法反映这一点的(因为可能这两个replica根本不通信, 而logical clock是依赖attach timestamp to message实现的).

所以他们使用TrueTime来解决...准确的physical time是不知道的. TrueTime是返回一个范围. [T{earliest}, T{latest}], 真实的physical timestamp在这范围之内. 这个范围大小是uncertainty. 每次commit的时候,需要等待uncertainty时间. 这能保证real time dependency.

TrueTime实现是每个datacenter装一个原子钟或者GPS接收器.....每30s同步一次时间.

image

可以看到其实平均uncertainty很短,也就4ms左右,这远低于cross-continent级别的latency (~100ms).

jasperzhong commented 3 years ago

不知不觉看完了....斗胆评价一下.

感觉这门课更新手友好一些,适合我这样的小白. 单独讲某个技术是没有意思的,这门课把技术产生的context讲清楚了.

这个表很重要.

Problem Must wait for communication Requires synchrony
atomic commit all participating nodes partially synchronous
consensus, total order broadcast, linearizable CAS quorum partially synchronous
linearizable get/set quorum asynchronous
eventual consistency, causal broadcast, FIFO broadcast local replica only asynchronous

最后一个是AP. 因为牺牲了强一致性但保证可用性,但也不是完全抛弃了一致性,等到通信恢复了,还是可以sync实现最终一致性,长远来看数据还是一致的.