Lec4: Primary/Backup Replication

kmansei commented 1 year ago

Lecture Note: https://pdos.csail.mit.edu/6.824/notes/l-vm-ft.txt Paper: https://pdos.csail.mit.edu/6.824/papers/vm-ft.pdf Paper Question: https://github.com/kmansei/6.5840/issues/4#issue-1783673545 Paper FAQ: https://pdos.csail.mit.edu/6.824/papers/vm-ft-faq.txt Lecture: https://www.youtube.com/watch?v=M_teob23ZzY&list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB&index=4

vSphere FT: https://qiita.com/takahashi-kazuki/items/a4c63ad41146b1fecffc#vsphere-ft%E3%81%AE%E7%89%B9%E5%BE%B4

kmansei commented 1 year ago

Abst Replication approachを使用したvirtual machines(VMs)のためのfoult-tolerantなシステムを実装した。彼らのシステムはVMware vSphere 4.0のコアとなっている。

kmansei commented 1 year ago

Intro

A common approach to implementing fault-tolerant servers is the primary/backup approach [1], where a backup server is always available to take over if the primary server fails

A different method for replicating servers that can use much less bandwidth is sometimes referred to as the statemachine approach [13]. The idea is to model the servers as deterministic state machines that are kept in sync by starting them from the same initial state and ensuring that they receive the same input requests in the same order.

kmansei commented 1 year ago

Basic Design スクリーンショット 2023-07-01 午後6 30 23

PrimaryとBackupのVMが別々の物理サーバーで実行
両VMは仮想的なロックステップで動作。タイムラグはわずか
ネットワークの入力は全てprimaryが処理し、ログチャネルを介してbackupへ
backupの出力は取り除かれ、primaryの出力だけがclientに届く

kmansei commented 1 year ago

2.1 二つのDFAは決定的な同じinputを同じ順番に与えれば、同じ初期状態から同じ状態の変化を経て同じ出力をする。しかしながらも、非決定的なイベント(virtual interupt)や操作(reading the clock cycle counter)が同期を難しくさせている。

VMware deterministic replayは流入してきたログを読んでbackupのリプレイを行う。非決定的な操作に関しては、さらに決定的な操作よりも詳細な情報を送ることで動機を行う。例えば、タイマーやIO完了などの非決定的なイベントについては、イベントが発生した正確な命令も記録される。

kmansei commented 1 year ago

2.2

Output Requirement: if the backup VM ever takes over after a failure of the primary, the backup VM will continue executing in a way that is entirely consistent with all outputs that the primary VM has sent to the external world.

出力要件：バックアップVMがプライマリの障害後に引き継ぐ場合、バックアップVMは、プライマリVMが外部へ送信したすべての出力と完全に整合した方法で実行を続けます。

kmansei commented 1 year ago

failover 稼働中のシステムで問題が生じてシステムやサーバーが停止してしまった際に、自動的に待機システムに切り替える仕組み

kmansei commented 1 year ago

primaryがアウトプットの途中で故障した場合、backupはprimaryがどこまでアウトプットしたのかを記録しておかないと、出力要件が満たせない。 primaryが故障した地点までリプレイを行い、その後はリプレイをやめprimaryとして外部に出力を行う。

kmansei commented 1 year ago

スクリーンショット 2023-07-02 午後2 51 46

Output Rule: the primary VM may not send an output to the external world, until the backup VM has received and acknowledged the log entry associated with the operation producing the output

kmansei commented 1 year ago

2.3 VMware FTはUDP heartbeatingを使用して、primary及びbackupの故障を検知する

kmansei commented 1 year ago

split brain問題ネットワークの問題などでprimaryが動作しているのにもかかわらず、primaryが故障したと判断しbackupを昇格するとprimaryが複数誕生する。VMware FTではshared storageを使用しており、shared storageに対してatomicなテストや操作を行い、失敗した場合は他にprimaryが動作していると判断して、自分のVMを停止させる。

kmansei commented 1 year ago

backupをprimaryに昇格させる際に、新たに別のhostをbackupとして誕生させる

kmansei commented 1 year ago

3.1 VMの複製にはVMware VMotionの機能を使って、別のリモートにVMのクローンを生成してログチャンネルをつなげる

kmansei commented 1 year ago

3.2 スクリーンショット 2023-07-02 午後4 06 58

primary VM -> primary log buffer -> channnel -> backup log buffer -> backup VM

backupの動作が遅いとprimary log bufferが詰まってしまうことがあるが、通常はbackupのreplayはprimaryと動作速度は同じくらい

kmansei commented 1 year ago

3.4 disk内の同じ位置への並列操作や同じメモリーへの並列操作は非決定的である。そこで、primaryやbackupではこのような並列操作に対して同じ順番で処理されるようにさせている。 MMUにpage protectionと呼ばれる仕組みを施すことで対処できるがコストがかかる。VM FTではbounce buffersという仕組みを利用。

bunce buffersはdiskとVMの間にある中間メモリ的な役割をしている。

kmansei commented 1 year ago

Paper Question VMware FTではprimaryとbackupのネットワークは別れているが、shared diskを利用しているためatomicな操作やテストが失敗していれば現在別のprimaryが動作していると判定し、そのVMを停止せることでsplit brain問題が起こらないようになっている。

kmansei commented 1 year ago

DMAとは

DMAは、Direct Memory Access（直接メモリアクセス）の略称です。DMAは、コンピューターシステムにおいてデータ転送を高速化するための技術や機能を指します。

通常、データの転送は、プロセッサ（CPU）がメモリからデータを読み取り、それをデバイス（例えば、ハードディスクやネットワークカード）に送信することで行われます。しかし、この方法ではプロセッサがデータ転送に関与するため、他のタスクとの競合が発生し、処理速度が低下する可能性があります。

DMAは、この制約を克服するために開発されました。DMAを使用すると、プロセッサはデータの転送処理を直接担当せず、デバイスが直接メモリにアクセスしてデータの読み書きを行うことができます。つまり、プロセッサはデータ転送に関与せず、他のタスクに集中することができます。これにより、データの転送速度が向上し、システムの全体的なパフォーマンスが向上します。

DMAは、ハードウェアやソフトウェアのレベルで実装されることがあります。ハードウェアDMAは、コンピューターアーキテクチャに組み込まれた専用の回路やコントローラを使用してデータ転送を処理します。ソフトウェアDMAは、プログラムやドライバがDMA機能を利用してデータ転送を管理する方法です。

DMAは、大容量のデータを高速かつ効率的に転送するために広く使用されています。例えば、ビデオやオーディオのストリームの再生、ディスクへのファイルの書き込み、ネットワークパケットの送受信などがDMAを利用して行われます。

DMAの使い道

ストレージデバイス：ハードディスクドライブ（HDD）やソリッドステートドライブ（SSD）などのストレージデバイスは、DMAを使用してデータの読み書きを行います。DMAを利用することで、ストレージデバイスとプロセッサ間のデータ転送が高速化され、パフォーマンスが向上します。

グラフィックスカード：グラフィックスカードは、高解像度のビデオや3Dグラフィックスを処理する際にDMAを使用します。DMAを介して、メモリからグラフィックスカードにデータを転送し、高速な描画や処理を実現します。

ネットワークインタフェースカード：ネットワークカードは、ネットワークパケットの送受信にDMAを使用します。DMAを通じて、ネットワークデータがメモリとネットワークカード間で直接転送され、高速かつ効率的なネットワーク通信が可能となります。

オーディオデバイス：オーディオデバイスやサウンドカードは、DMAを使用して音声データの再生や録音を行います。DMAを利用することで、プロセッサの負荷を軽減し、リアルタイムのオーディオ処理を実現します。

これらは一部の例であり、DMAは他のさまざまなデバイスやシステムでも使用されています。DMAの利点である高速なデータ転送とリソースの効率的な利用は、現代のコンピューターシステムにおいて重要な役割を果たしています。

kmansei commented 1 year ago

同期の流れ、具体例

The logging channel
  primary sends all events to backup over network
    "logging channel", carrying log entries
    interrupts, incoming network packets, data read from shared disk
  FT provides backup's input (interrupts &c) from log entries
  FT suppresses backup's network output
  if either stops being able to talk to the other over the network
    "goes live" and provides sole service
    if primary goes live, it stops sending log entries to the backup

Each log entry: instruction #, type, data.

FT's handling of timer interrupts
  Goal: primary and backup should see interrupt at exactly
        the same point in the instruction stream
  Primary:
    FT fields the timer interrupt
    FT reads instruction number from CPU
    FT sends "timer interrupt at instruction # X" on logging channel
    FT delivers interrupt to primary, and resumes it
    (relies on CPU support to direct interrupts to FT software)
  Backup:
    ignores its own timer hardware
    FT sees log entry *before* backup gets to instruction # X
    FT tells CPU to transfer control to FT at instruction # X
    FT mimics a timer interrupt that backup guest sees
    (relies on CPU support to jump to FT after the X'th instruction)

FT's handling of network packet arrival (input)
  Primary:
    FT configures NIC to write packet data into FT's private "bounce buffer"
    At some point a packet arrives, NIC does DMA, then interrupts
    FT gets the interrupt, reads instruction # from CPU
    FT pauses the primary
    FT copies the bounce buffer into the primary's memory
    FT simulates a NIC interrupt in primary
    FT sends the packet data and the instruction # to the backup
  Backup:
    FT gets data and instruction # from log stream
    FT tells CPU to interrupt (to FT) at instruction # X
    FT copies the data to guest memory, simulates NIC interrupt in backup

kmansei commented 1 year ago

bounce bufferが必要な理由

Q: How do Section 3.4's bounce buffers help avoid races?

A: The problem arises when a network packet or requested disk block arrives at the primary and needs to be copied into the primary's memory. Without FT, the relevant hardware copies the data into memory while software is executing. Guest instructions could read that memory during the DMA; depending on exact timing, the guest might see or not see the DMA'd data (this is the race). It would be bad if the primary and backup both did this, since due to slight timing differences one might read just after the DMA and the other just before. In that case they would diverge.

FT avoids this problem by not copying into guest memory while the primary or backup is executing. FT first copies the network packet or disk block into a private "bounce buffer" that the primary cannot access. When this first copy completes, the FT hypervisor interrupts the primary so that it is not executing. FT records the point at which it interrupted the primary (as with any interrupt). Then FT copies the bounce buffer into the primary's memory, and after that allows the primary to continue executing. FT sends the data to the backup on the log channel. The backup's FT interrupts the backup at the same instruction as the primary was interrupted, copies the data into the backup's memory while the backup is into executing, and then resumes the backup.

The effect is that the network packet or disk block appears at exactly the same time in the primary and backup, so that no matter when they read the memory, both see the same data.

kmansei commented 1 year ago

State Transfer vs Replicated State Machine https://youtu.be/M_teob23ZzY?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB&t=548

State Transferはprimaryの全ての状態のコピーをbackupに送る。状態のコピーを元にbackupをprimaryとして復元する的な考え。 Memory、データベースetc

Replicated State Machineは状態を送らない。外部からの入力をbackupに流す。同じ状態のサーバーを同じタイミングで同じ入力を与えれば同じ結果を返すよね的な考え。 backupに送るのは外部からの操作なので、State Transferよりも軽いが複雑

kmansei commented 1 year ago

State Transferはmulti-coreプロセッサーのReplicationに向いている https://youtu.be/M_teob23ZzY?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB&t=956

VMware FTはハードウェアがmulti-coreでもsingle-coreとしてエミュレートしてゲストOSを動かす。Multi-core VMは論文では議論されていない https://youtu.be/M_teob23ZzY?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB&t=2385

kmansei commented 1 year ago

Logging Channelのほとんどは流入するnetwork packetsであり、ほんの一部がnon-deterministicなイベント(割り込み、ランダム生成、Time取得)であると推測される https://youtu.be/M_teob23ZzY?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB&t=3193

kmansei commented 1 year ago

Clientにoutputは送ったが、backupへ転送中にPrimaryが故障したらどうするか https://youtu.be/M_teob23ZzY?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB&t=3375

VMware FTではbackupがackを返すまで、primaryは出力しない

kmansei commented 1 year ago

ReplicationによるFault-tolerateなシステムを設計をする上で、重複する出力が発生するのを避けるのは難しい https://youtu.be/M_teob23ZzY?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB&t=4183

kmansei / 6.5840

Lec4: Primary/Backup Replication #4

DMAとは

DMAの使い道