Cambridge 20/21 | Concurrent and Distributed Systems (partIB)

jasperzhong / cs-notes

CS认知体系

6 stars 0 forks source link

Cambridge 20/21 | Concurrent and Distributed Systems (partIB) #1

Closed jasperzhong closed 3 years ago

jasperzhong commented 3 years ago

https://www.youtube.com/playlist?list=PLeKd45zvjcDFUEv_ohr_HdUFe97RItdiB

notes: https://www.cl.cam.ac.uk/teaching/2021/ConcDisSys/dist-sys-notes.pdf

讲课的是DDIA作者Martin Kleppmann. 太赞了.

话说英国口音真好听（逃

jasperzhong commented 3 years ago

Lecture 1

主要是回顾network基础

latency: 消息传递所需要的时间
bandwidth: 单位时间传输数据量

RPC: 中间那个序列化过程叫做marshal / unmarshal, 做这件事的叫stub.

REST居然也是rpc, 消息走的是json.

microservices之间一般通过rpc通信.

interface definition language(IDL): 语言独立的API specification. 比如protobuf.

jasperzhong commented 3 years ago

Lecture 2: Models of distributed systems

两军问题

由于信道不可靠，导致两个人无法达成共识.

比如A发消息给B，有两种情况

B收到了，但是ack消息挂了.
B没收到.

这两种情况对于A来看是一样的——都是no response. 但是对于B来讲又是两件事.

A有两个选择:

一定attack: 多次发送相同消息. 只要有一个消息成功那就成功了. 但仍有可能失败.
只有收到ack才attack: 但这样B就处于和A相同的境地了. 因为B不知道ack消息A是否收到. 这样会无限死循环下去. （猜疑链）

这个问题被证明是无解的.

TCP就是这样采用方案1，没收到ack就重传，直到timeout.

拜占庭将军问题

更general的问题: 一些节点会behave maliciously , 比如会恶意背叛.

Theorem: 需要3f + 1个节点才能满足f个Byzantine faults.

Systems models

这个总结很精辟.

Two generals problem: a model of networks
Byzantine generals problem: a model of node behaviour

而现实系统中，node和network都会出现故障.

考虑三个方面

network behaviour
node behaviour
timing behaviour

network behaviour

reliable links: 不会丢message
fair-loss links: message可能会丢，重复，乱序. 但是不断重发，有一个message最终会get through
arbitrary: 存在恶意攻击者，干扰甚至篡改message.

network partition: 一部分links断了.

之间转换：

fair-loss links + retry + dedup -> reliable links
arbitrary links + TLS -> fair-loss links

node behaviour

crash-stop (fail-stop): crash后就永远宕机. 比如地震后机器全毁了.
crash-recovery (fail-recovery): crash后失去in-memory state. 但是之后可以resume.
byzantine (fail-arbitrary): 恶意节点

synchrony(timing) assumptions

synchronous: latency有上限. 节点执行速度一定.
asynchronous: latency无上限. 节点可能突然停止执行.
partially synchronous: 一定时间系统asynchronous. 其余时间synchronous.

这里的"同步" "异步"概念和计算机领域其他地方出现的不太一样.

为何network会产生很高的latency

congestion
route reconfiguration. 比如要求在buffer待很久.

为何node会偶尔停止执行

thread scheduling
stop-the-world gc
page faults, swap, trashing

确实.

针对场景选择一个合适的system model很重要. 比如on-prem cluster就不要去考虑arbitrary network和byzantine fault了.

Fault Tolerance

fault和failure意思不一样.

failure(outage): 整个系统不工作了
fault(故障): 部分系统不工作了.

容错: 出现故障后系统能继续工作.

failure detector: 判断node是否故障. 一般通过心跳包是否超时判断.

这个我记得redis有一个不错的设计: 需要多数表决某个node是否宕机. https://medium.com/kokster/fault-tolerant-redis-architecture-35334ef5de0b

jasonliu747 commented 3 years ago

给钟哥点赞:thumbsup:

jasperzhong commented 3 years ago

Lecture 3: Time, clocks and ordering of events

physical clocks: count seconds elapsed.
logical clocks: count events.

Physical time

现在一般用GPS作为物理时间源了. 卫星携带atomic clock，广播时间. 所以datacenter屋顶一般有天线.

leap second: 闰秒，传说中的“+1s”和"-1s". 发生在每年30 June和31 Dec. 现在软件一般都会忽略. 但有时候真会make a difference....

Unix time: number of seconds since 1970.1.1 UTC. 不包括leap seconds.
ISO 8601: 就是转成年月日时分秒，并且切换时区.

NTP: network time protocol 同步物理时间的协议. 分NTP client和NTP server两种节点. 如何做同步呢?

这里面有一个编程上的坑...

这个坑以前还真不知道... cxxreference对于std::chrono::system_clock有一段话:

The system_clock's time value can be internally adjusted at any time by the operating system, for example due to NTP synchronization or the user changing the system's clock. Daylight Savings Time and time zone changes, however, do not affect it since it is based on the UTC time-zone.

看来以后测量interval要用std::chrono::steady_clock...

Class std::chrono::steady_clock represents a monotonic clock. The time points of this clock cannot decrease as physical time moves forward and the time between ticks of this clock is constant. This clock is not related to wall clock time (for example, it can be time since last reboot), and is most suitable for measuring intervals.

不要用std::chrono::high_resolution_clock，这个只是上面两个clock的alias，但具体是哪一个不同实现又不一样. 所以不要用.

Causality and happens-before

event: 发生在某个节点的某件事. 比如，收发消息，或者一个local execution step.
happens-before relation: 记作a -> b, 必须满足下列条件之一
- 事件a和b发生在同一个节点上，事件a在节点的local execution step发生先于事件b
- 事件a是发送消息m,而事件b是收到消息m.
- 存在事件c，使得 a -> c并且 c -> b

happens-before relation是partial order. 可能 a->b b->a都不成立，这时 a和b是concurrent的，记作 a || b.

causality(因果):

when a->b, a可能导致b发生
when a || b, a不可能是导致b发生的原因.

happens-before有潜在的causality.

causal order是一个partial order.

jasperzhong commented 3 years ago

Lecture 4: Broadcast protocal and logical time

logical time

logical clock其实就是个counter, designed to capture causal dependencies. i.e. a -> b => T(a) < T(b)

Lamport clocks

每个node都有一个counter t
attach t to messages
a -> b => L(a) < L(b) 但是反过来不成立. 甚至可能 L(a) = L(b) 但是 a != b

(L(e), N(e))可以uniquely identifies event e. N(e)是指node name顺序. 这样可以定义一个total order.

缺点:

L(a) < L(b) 不能分辨 a -> b 还是 a || b (需要vector clocks)

Vector clocks

用一个vector而不是scalar才存counter.

用generalized inequality来确定order.

properties of this order:

V(a) < V(b) <=> a -> b
V(a) = V(b) <=> a = b (the same event)
V(a) || V(b) <=> a || b

这个不错.

Broadcast protocols

目前看到的都是p2p communication. 考虑group communication.

起到一个中间件作用. 上面是应用，下面是网络.

Reliable broadcast: 根据order顺序分类

FIFO broadcast: messages sent by the same node must be delivered in the order they were sent. (node: messages sent by different nodes can be delivered in any order)
Causal broadcast: causally related message must be delivered in causal order. (note: concurrent messages can be delivered in any order)
Total order broadcast: all nodes must diliver messages in the same order (note: 并行事件需要确定一个顺序，无论什么顺序，只要所有节点收到的顺序一致就行）

之间关系.

有点好奇...NCCL, OpenMPI这些中间件是实现了哪一种broadcast...

Broadcast algorithms

分成两个问题:

如何实现reliable broadcast 2, 如何实现各种order

对于问题1的一些办法:

eager reliable broadcast: 一个节点第一次收到一个消息会广播一次. 复杂度O(n^2)
gossip protocols: 一个节点第一次收到一个消息向周围随机三个节点发送消息. 这就是为么叫gossip. 最终高概率reach所有nodes.

对于问题2的一些办法: 都需要一个buffer.

FIFO broadcast:

很简单，就是deliver顺序一定要从0 count到N.

Causal broadcast algorithm: 其实就是利用vector clock实现causality. 方法类似.
total order broadcast: 比较麻烦，因为需要有共识——到底是什么顺序.
- single learder approach: 一切顺序以leader为准. 如果一个节点要发送消息，发给leader. leader去做broadcast. 问题是leader crashes等等.
- Lamport clocks approach: 利用(L(e), N(e))构成的total order.

这两个办法都没有实现fault tolerance.

jasperzhong commented 3 years ago

Replication

终于要开始fault tolerance话题了. replication算是最widely used的一个方法了.

replica: a node that has a copy of the data

好处:

fault tolerance
load balance

最大的问题: consistency

举了一个点赞的例子. 非常有意思.

idempotence: 幂等. f(x) = f(f(x)).

比如会出现下面这样的inconsistency问题，而且还无法区分是哪种情况，在server看来结果都一样.

解决办法: 利用timestamp (得用logical time). 下图中false不是真的remove，而是指暂时invisible，叫tomestone (因为可能其他node收到更新的true消息). tomestone可以之后gc. 这样可以tell哪个消息是更新的.

然后replicas之间周期性sync，检查所有inconsistencies，以最新的写入为准.

这里的logical clock用lamport或者vector都可以:

使用lamport clock (with total order): 实现的逻辑是last writer wins(LWW). 这有data loss，丢掉了另一个并行发生的写入.
使用vector clock (with partial order): 如果t1 || t2 则保存并行发生的写入. 这叫multi-value register.

Quorums

importance of fault tolerance: achieve high-availability

problem: read-after-write consistency

solution: quorum机制 e.g. 2 out of 3, 3 out of 4.

write quorum: w replicas
read quorum: r replicas
r + w > n

其实就是鸽笼原理. 最后读出来的可能有一个是之前write的value. 根据timestamp就能得到最新值.

State machine replication (SMR)

考虑用broadcast protocol做replication.

total order broadcast: every node delivers the same messages in the same order.

State machine replication (SMR):利用(FIFO-)total order broadcast同步更新update到所有replicas. replica是一个state machine，开始状态都一样，由于每个replica收到的update顺序都一样，所以保证了replica的最后状态一定一样.

limitations:

需要coordination时间.
如何做fault-tolerance的total order broadcast.

能否用更弱的broadcast, 比如causal broadcast? 看情况，如果replica state updates是commutative的，如下表所示:

jasperzhong commented 3 years ago

Consensus

最重要的一章. 主要掌握Raft.

先抛出问题: 如何做fault-tolerance的total order broadcast. 比如，leader-based的实现中，如果leader挂了或者unavailable怎么办?

manual failover，即人为干预. 人的速度太慢了，需要很长时间才能recovery. 必须要实现automatically choose a new leader. 这就是consensus.

consensus: 几个node需要对a single value达成共识. 在total order broadcast的context下，这个value指的是the next message to deliver.

常见的consensus algorithms:

Paxos
Raft

主要看Raft.

Raft

consensus system models Paxos, Raft assume a partially synchronous, crash-recovery system model.

为何不选择asynchronous的model呢?

因为FLP result证明了这种情况下是不可能做到consensus的

There is no deterministic consensus algorithm that is guaranteed to terminate in an asynchronous crash-stop system model.
Paxos, Raft使用clocks来检测failure(比如用timeout). 但correctness并不依赖于timing.

p.s. 针对partially synchronous Byzantine system model的consensus algorithm: blockchains...

Leader election

Raft使用leader来sequence messages.

使用failure detector (timeout) 来检测是否crashed/unavailability of leader
如果怀疑leader crash, 重新election
防止出现两个leader (split-brain)

每个term <= 1个leader

每当leader election开始的时候，term++
每个节点只能vote once
多票当选 (quorum)

如何保证只有一个leader? Example: leader节点由于网络原因unavailability了，其他节点选举一个新leader in term t+1. 但此时原leader节点不知道已经选出了一个新的leader了.

所以做每个decision (deliver message)前，leader需要问followers我是不是你的leader. 如果得到了quorum个回答，就有效.

Node state transitions in Raft

节点状态本身是个state machine.

mpv-shot0019

Implementation

这集太长了. 直接看视频walk through吧...

https://www.youtube.com/watch?v=IPnesACYRck

jasperzhong commented 3 years ago

Replica consistency

先给个consistency定义

ACID: 数据库的一个概念. atomicity, consistency, isolation, durability. 是为了保证transactions的有效性而提出的. 这不是distributed system讨论的概念.
Read-after-write consistency.
Replication: replica should be "consistent" with other replicas. 对"consistent"概念的不同理解就有了不同的consistency model.

Two-phase commit

先看看atomicity in ACID transactions

atomicity: 一个transaction要么commit要么aborts
一旦commit, 这个update是durable
一旦aborts, 要没有副作用
ACID consistency 依赖于atomicity
分布式的ACID, 需要全部commit或者全部abort
如果any node crashes, 其他node全部要abort

但这和consensus不太一样. 具体对比如下:

一般实现atomicity in ACID是通过two-phase commit (2PC)

如果coordinator crashes.

coordinator把决定写入到disk
恢复的时候读取decision然后发送到replicas. (或者abort如果在crash前未写入)

problem: 如果coordinator在prepare后broadcasting decision之前挂了，那么replicas会一直block直到coordinator recovers.

solution: 用total order broadcast实现fault-tolerance two-phase commit.

Linearizability (线性一致性)

等同于strong consistency. 和serializability不是一个意思.

Linearizability: all operations behave as if executed on a single copy of the data. (focus on client-observable behaviour, ignore how the replication system is implemented internally)

Informally: every operation takes effect atomically sometime after it started and before it finished.

大概意思是如果read-after-write, 要保证read的是最新值. 如果是read和write有overlap，那么不能保证该read操作得到的是最新值.

另外，quorum机制并不能保证linearizability. 如果要保证，需要client收到inconsistency的数据后，把new value写入到那些还存old value的replicas.

还可以用total order broadcast + CAS实现

Linearizability advantages:

make a distributed system behave as if it were non-distributed
simple for application to use

Downsides:

performance cost: lots of messages and waiting for responses
scalability limits: leader can be a bottleneck
availability problems: 必须要quorum of nodes available.

Eventual consistency

CAP theorem

consistent (linearizable)
available
network partition

eventual consistency: replica用local state处理operations.

strong eventual consistency:

eventual delivery: non-faulty replica做的所有update都会被所有non-faulty replica做
convergence: 任何两个replicas处理过相同set of updates处于相同状态，允许处理顺序不同.

好处:

不需要等待network communication
可以使用causal broadcast甚至更弱的broadcast
允许concurrent updates. (需要处理conflicts)

总结

Problem	Must wait for communication	Requires synchrony
atomic commit	all participating nodes	partially synchronous
consensus, total order broadcast, linearizable CAS	quorum	partially synchronous
linearizable get/set	quorum	asynchronous
eventual consistency, causal broadcast, FIFO broadcast	local replica only	asynchronous

jasperzhong commented 3 years ago

Concurrency control in applications

collaboration software

比如google docs. 每个用户其实都有一个local replica. 可以update local replica anytime(甚至offline). sync with others when network available.

algorithms:

Conflict-free replicated data types (CRDTs)
Operational Transformation (OT)

不是很感兴趣. 没看算法细节.

Google Spanner

集大成之作

a data system with millions of nodes, petabytes of data, distributed across datacenters worldwide.

consistency properties:

serializable transaction isolation
linearizable reads and writes
many shards. atomic commit of transactions across shards.

使用的techniques:

State machine replication (Paxos) within a shard
two-phase locking for serializability
two-phase commit for cross-shard atomicity

有趣的点: read-only transactions require no locks.

MVCC: multi-version concurrency control

为了得到commit timestamps. 需要满足causality，即T1 -> T2 => t1 < t2. 本来是用logical clocks来解决这个问题的. 但是linearizability取决于real-time order，logical time是无法反映这一点的(因为可能这两个replica根本不通信, 而logical clock是依赖attach timestamp to message实现的).

所以他们使用TrueTime来解决...准确的physical time是不知道的. TrueTime是返回一个范围. [T{earliest}, T{latest}], 真实的physical timestamp在这范围之内. 这个范围大小是uncertainty. 每次commit的时候，需要等待uncertainty时间. 这能保证real time dependency.

TrueTime实现是每个datacenter装一个原子钟或者GPS接收器.....每30s同步一次时间.

可以看到其实平均uncertainty很短，也就4ms左右，这远低于cross-continent级别的latency (~100ms).

jasperzhong commented 3 years ago

不知不觉看完了....斗胆评价一下.

感觉这门课更新手友好一些，适合我这样的小白. 单独讲某个技术是没有意思的，这门课把技术产生的context讲清楚了.

这个表很重要.

Problem	Must wait for communication	Requires synchrony
atomic commit	all participating nodes	partially synchronous
consensus, total order broadcast, linearizable CAS	quorum	partially synchronous
linearizable get/set	quorum	asynchronous
eventual consistency, causal broadcast, FIFO broadcast	local replica only	asynchronous

最后一个是AP. 因为牺牲了强一致性但保证可用性，但也不是完全抛弃了一致性，等到通信恢复了，还是可以sync实现最终一致性，长远来看数据还是一致的.