Closed jasperzhong closed 3 years ago
主要是回顾network基础
RPC: 中间那个序列化过程叫做marshal / unmarshal, 做这件事的叫stub.
REST居然也是rpc, 消息走的是json.
microservices之间一般通过rpc通信.
interface definition language(IDL): 语言独立的API specification. 比如protobuf.
由于信道不可靠,导致两个人无法达成共识.
比如A发消息给B,有两种情况
这两种情况对于A来看是一样的——都是no response. 但是对于B来讲又是两件事.
A有两个选择:
这个问题被证明是无解的.
TCP就是这样采用方案1,没收到ack就重传,直到timeout.
更general的问题: 一些节点会behave maliciously , 比如会恶意背叛.
Theorem: 需要3f + 1个节点才能满足f个Byzantine faults.
这个总结很精辟.
而现实系统中,node和network都会出现故障.
考虑三个方面
network behaviour
network partition: 一部分links断了.
之间转换:
node behaviour
synchrony(timing) assumptions
这里的"同步" "异步"概念和计算机领域其他地方出现的不太一样.
为何network会产生很高的latency
为何node会偶尔停止执行
确实.
针对场景选择一个合适的system model很重要. 比如on-prem cluster就不要去考虑arbitrary network和byzantine fault了.
fault和failure意思不一样.
容错: 出现故障后系统能继续工作.
failure detector: 判断node是否故障. 一般通过心跳包是否超时判断.
这个我记得redis有一个不错的设计: 需要多数表决某个node是否宕机. https://medium.com/kokster/fault-tolerant-redis-architecture-35334ef5de0b
给钟哥点赞:thumbsup:
现在一般用GPS作为物理时间源了. 卫星携带atomic clock,广播时间. 所以datacenter屋顶一般有天线.
leap second: 闰秒,传说中的“+1s”和"-1s". 发生在每年30 June和31 Dec. 现在软件一般都会忽略. 但有时候真会make a difference....
NTP: network time protocol 同步物理时间的协议. 分NTP client和NTP server两种节点. 如何做同步呢?
这里面有一个编程上的坑...
这个坑以前还真不知道... cxxreference对于std::chrono::system_clock有一段话:
The system_clock's time value can be internally adjusted at any time by the operating system, for example due to NTP synchronization or the user changing the system's clock. Daylight Savings Time and time zone changes, however, do not affect it since it is based on the UTC time-zone.
看来以后测量interval要用std::chrono::steady_clock...
Class std::chrono::steady_clock represents a monotonic clock. The time points of this clock cannot decrease as physical time moves forward and the time between ticks of this clock is constant. This clock is not related to wall clock time (for example, it can be time since last reboot), and is most suitable for measuring intervals.
不要用std::chrono::high_resolution_clock,这个只是上面两个clock的alias,但具体是哪一个不同实现又不一样. 所以不要用.
happens-before relation是partial order. 可能 a->b b->a都不成立,这时 a和b是concurrent的,记作 a || b.
causality(因果):
happens-before有潜在的causality.
causal order是一个partial order.
logical clock其实就是个counter, designed to capture causal dependencies. i.e. a -> b => T(a) < T(b)
(L(e), N(e))可以uniquely identifies event e. N(e)是指node name顺序. 这样可以定义一个total order.
缺点:
用一个vector而不是scalar才存counter.
用generalized inequality来确定order.
properties of this order:
这个不错.
目前看到的都是p2p communication. 考虑group communication.
起到一个中间件作用. 上面是应用,下面是网络.
Reliable broadcast: 根据order顺序分类
之间关系.
有点好奇...NCCL, OpenMPI这些中间件是实现了哪一种broadcast...
分成两个问题:
对于问题1的一些办法:
对于问题2的一些办法: 都需要一个buffer.
很简单,就是deliver顺序一定要从0 count到N.
这两个办法都没有实现fault tolerance.
终于要开始fault tolerance话题了. replication算是最widely used的一个方法了.
好处:
最大的问题: consistency
举了一个点赞的例子. 非常有意思.
idempotence: 幂等. f(x) = f(f(x)).
比如会出现下面这样的inconsistency问题,而且还无法区分是哪种情况,在server看来结果都一样.
解决办法: 利用timestamp (得用logical time). 下图中false不是真的remove,而是指暂时invisible,叫tomestone (因为可能其他node收到更新的true消息). tomestone可以之后gc. 这样可以tell哪个消息是更新的.
然后replicas之间周期性sync,检查所有inconsistencies,以最新的写入为准.
这里的logical clock用lamport或者vector都可以:
importance of fault tolerance: achieve high-availability
problem: read-after-write consistency
solution: quorum机制 e.g. 2 out of 3, 3 out of 4.
其实就是鸽笼原理. 最后读出来的可能有一个是之前write的value. 根据timestamp就能得到最新值.
考虑用broadcast protocol做replication.
total order broadcast: every node delivers the same messages in the same order.
State machine replication (SMR):利用(FIFO-)total order broadcast同步更新update到所有replicas. replica是一个state machine,开始状态都一样,由于每个replica收到的update顺序都一样,所以保证了replica的最后状态一定一样.
limitations:
能否用更弱的broadcast, 比如causal broadcast? 看情况,如果replica state updates是commutative的,如下表所示:
最重要的一章. 主要掌握Raft.
先抛出问题: 如何做fault-tolerance的total order broadcast. 比如,leader-based的实现中,如果leader挂了或者unavailable怎么办?
manual failover,即人为干预. 人的速度太慢了,需要很长时间才能recovery. 必须要实现automatically choose a new leader. 这就是consensus.
consensus: 几个node需要对a single value达成共识. 在total order broadcast的context下, 这个value指的是the next message to deliver.
常见的consensus algorithms:
主要看Raft.
consensus system models Paxos, Raft assume a partially synchronous, crash-recovery system model.
为何不选择asynchronous的model呢?
因为FLP result证明了这种情况下是不可能做到consensus的
There is no deterministic consensus algorithm that is guaranteed to terminate in an asynchronous crash-stop system model.
Paxos, Raft使用clocks来检测failure(比如用timeout). 但correctness并不依赖于timing.
p.s. 针对partially synchronous Byzantine system model的consensus algorithm: blockchains...
Raft使用leader来sequence messages.
每个term <= 1个leader
如何保证只有一个leader? Example: leader节点由于网络原因unavailability了,其他节点选举一个新leader in term t+1. 但此时原leader节点不知道已经选出了一个新的leader了.
所以做每个decision (deliver message)前,leader需要问followers我是不是你的leader. 如果得到了quorum个回答,就有效.
节点状态本身是个state machine.
这集太长了. 直接看视频walk through吧...
先给个consistency定义
先看看atomicity in ACID transactions
但这和consensus不太一样. 具体对比如下:
一般实现atomicity in ACID是通过two-phase commit (2PC)
如果coordinator crashes.
problem: 如果coordinator在prepare后broadcasting decision之前挂了,那么replicas会一直block直到coordinator recovers.
solution: 用total order broadcast实现fault-tolerance two-phase commit.
等同于strong consistency. 和serializability不是一个意思.
Linearizability: all operations behave as if executed on a single copy of the data. (focus on client-observable behaviour, ignore how the replication system is implemented internally)
Informally: every operation takes effect atomically sometime after it started and before it finished.
大概意思是如果read-after-write, 要保证read的是最新值. 如果是read和write有overlap,那么不能保证该read操作得到的是最新值.
另外,quorum机制并不能保证linearizability. 如果要保证,需要client收到inconsistency的数据后,把new value写入到那些还存old value的replicas.
还可以用total order broadcast + CAS实现
Linearizability advantages:
Downsides:
CAP theorem
eventual consistency: replica用local state处理operations.
strong eventual consistency:
好处:
Problem | Must wait for communication | Requires synchrony |
---|---|---|
atomic commit | all participating nodes | partially synchronous |
consensus, total order broadcast, linearizable CAS | quorum | partially synchronous |
linearizable get/set | quorum | asynchronous |
eventual consistency, causal broadcast, FIFO broadcast | local replica only | asynchronous |
比如google docs. 每个用户其实都有一个local replica. 可以update local replica anytime(甚至offline). sync with others when network available.
algorithms:
不是很感兴趣. 没看算法细节.
集大成之作
a data system with millions of nodes, petabytes of data, distributed across datacenters worldwide.
consistency properties:
使用的techniques:
有趣的点: read-only transactions require no locks.
MVCC: multi-version concurrency control
为了得到commit timestamps. 需要满足causality,即T1 -> T2 => t1 < t2. 本来是用logical clocks来解决这个问题的. 但是linearizability取决于real-time order,logical time是无法反映这一点的(因为可能这两个replica根本不通信, 而logical clock是依赖attach timestamp to message实现的).
所以他们使用TrueTime来解决...准确的physical time是不知道的. TrueTime是返回一个范围. [T{earliest}, T{latest}], 真实的physical timestamp在这范围之内. 这个范围大小是uncertainty. 每次commit的时候,需要等待uncertainty时间. 这能保证real time dependency.
TrueTime实现是每个datacenter装一个原子钟或者GPS接收器.....每30s同步一次时间.
可以看到其实平均uncertainty很短,也就4ms左右,这远低于cross-continent级别的latency (~100ms).
不知不觉看完了....斗胆评价一下.
感觉这门课更新手友好一些,适合我这样的小白. 单独讲某个技术是没有意思的,这门课把技术产生的context讲清楚了.
这个表很重要.
Problem | Must wait for communication | Requires synchrony |
---|---|---|
atomic commit | all participating nodes | partially synchronous |
consensus, total order broadcast, linearizable CAS | quorum | partially synchronous |
linearizable get/set | quorum | asynchronous |
eventual consistency, causal broadcast, FIFO broadcast | local replica only | asynchronous |
最后一个是AP. 因为牺牲了强一致性但保证可用性,但也不是完全抛弃了一致性,等到通信恢复了,还是可以sync实现最终一致性,长远来看数据还是一致的.
https://www.youtube.com/playlist?list=PLeKd45zvjcDFUEv_ohr_HdUFe97RItdiB
notes: https://www.cl.cam.ac.uk/teaching/2021/ConcDisSys/dist-sys-notes.pdf
讲课的是DDIA作者Martin Kleppmann. 太赞了.
话说英国口音真好听(逃