hexresearch / hschain

Other
4 stars 0 forks source link

Rewrite gossip to use UDP instead of TCP #225

Open anton-k opened 5 years ago

anton-k commented 5 years ago

We need to rewrite gossip to use UDP, test with xenochain that transition is reliable and estimate performance gains.

thesz commented 5 years ago

I think I'll take on this.

I've found the places where to look at and copy from. The code looks like connection-oriented, mostly: NetworkAPI is connection oriented (connect, listenOn, etc). But P2PConnection is not - it provides for just send, receive and close actions that are relatively easy to implement with UDP. I will closely look at the use of NetworkAPI to decide how to deal with it.

The main concern here is that message may exceed UDP datagram limit (about 64K for IPv4 which is itself split into Ethernet frames) and we have to assembly things back. I would like to approach that with adding two fields to an UDP payload: a "front index" field and an offset within the front. The "front index" is equal for all UDP packets of a bigger message and allows to identify what we have to assembly. It will increase monotonically (modulo, say, 256). The offset within the front is where this concrete UDP payload (minus front index and offset field) must be put.

If message has to be split into several UDP payloads, all payloads before last one must have size bigger than some limit (say, 65000 - or close to Ethernet frame size if we don't trust underlying protocol suite). The last one must have size less or equal to the limit.

The offset of UDP payload to be added to a message must be equal to the sum of UDP payloads with offsets smaller than current - a sanity check. This way we can randomize sizes of parts to some extent and (probabilistically) prevent parts from older fronts to be part of the current message we receive.

thesz commented 5 years ago

While I am at it, please look at https://github.com/jepsen-io/jepsen - this thing allows to test distributed systems by injecting faults. It also reports on the issues found and allows to test performance in the presence of faults at different rates.

Shimuuar commented 5 years ago

I caught a cold and cannot think straight. I'll reply when I feel better

voidlizard commented 5 years ago

We should not literally rewrite GOSSIP for using UDP instead of TCP. We should keep a possibility to use TCP or UDP depending on application / situation.

Shimuuar commented 5 years ago

Actually TCP is baked into current network implementation very deeply and switching to UDP will take a lot of work.

Small datagrams

One problems is practical limit on UDP datagram is rather small 512B or so. Otherwise it could be fragmented, dropped, or fragmented and then dropped. This is rather tight limit

So here is list of messages that we need to exchange

So we have two problems with large messages: 1) blocks/proposals 2) transactions. There're gossiped in different way and probably should be dealed in different way as well.

Blocks/proposals

General idea about block is that want transmit them in chunks where we could get different parts from different peers. It's not implemented at the moment although it's been planned for long time (#21). Also node should only accept blocks which it know either because it have correct proposal or b

We can have malicious nodes as well. The can transmit invalids parts of block. So it's highly desirable to be able to check whether part block is valid without assembling whole block. One approach is to build merkle tree from block and send it as proposal. It however brings another problem: we probably don't want to store full merkle tree since it's rather large. And this will create problems for node catchup. We don't store proposals, so node will only get commit which contains only top level hash. It's unclear how to transmit full merkle tree to lagging node.

Transactions

They are smaller and we don't assemble them from different peers. So simpler setup will suffice

Other considerations

Small datagram size is not only consideration. We need to do other things:

  1. Track whether peer is still alive. (heartbeat?)

  2. Implement some measure for controlling message flood. Peers can send messages faster than we can process them. At the momemnt we have queue for incoming messages when it's full network thread blocks as well. With TCP other side blocks too. With UDP we will probably lose packets.

Shimuuar commented 5 years ago

Also one of reasons for rewrite to UDP is efficiency. There're much lower hanging fruits for improving gossip efficiency.

  1. Thundermint.Crypto.Containers.ValidatorISet is set of numbers from 0 to N. At the moment it's an IntSet. Since N is relatively small it could be replaced with something more efficient.

  2. Selection of votes to gossip in Thundermint.P2P is written in really inefficient way

thesz commented 5 years ago

The packet size for UDP is, at the very least, as big as TCP packet size. Most often it is about 1500 bytes or so. There is a way to transmit bigger UDP packets, up to 65536 bytes for IPv4 and up to 4G bytes for IPv6. They will be split into smaller packets and reassembled if possible at the network protocol level. In that case it is kernel's work to wait for all parts, assemble them, etc.

My current plan is to have UDP implementation of what TCP implementation does right now. Basically, you will have (almost) same functionality with different transport. Except there will be lost messages and other corruptions. ;)

Given that, we can experiment with, say, typed messages - e.g., extend P2PConnection with the ability to extract, possible out-of-order, messages of different types. This is, basically, functionality of Cloud Haskell from which we can borrow parts.

Shimuuar commented 5 years ago

I no expert on network matter but quick search suggest that UDP fragmentation doesn't work reliably.

Also I'm afraid that trying to implement current abstract API in UDP will result in suboptimal reimplementation TCP in UDP. Anyway it's probably better to discuss on tomorrow call

thesz commented 5 years ago

There is a thing that is Lustre cluster file system. It works for hundreds of machines. The author once told me that they had a bug lurking in their code for several years - they haven't tested UDP resend, because, as it appear, they haven't lost a single UDP packet in more than seven years of active use of Lustre in various environments.

Lustre implements Paxos for highly loaded environments, so it is quite close to what we are doing here.

No, it will not result of reimplementation of TCP in UDP. The logic to work with UDP will be different. What if you lost some of messages? With UDP you have to think about it.

I also think that we can discuss by voice but discussion conclusion must be written here.

Shimuuar commented 5 years ago

Thing is nodes (if all is going well) won't be sitting in single data center with well configured network. They will be scattered all over the globe and will communicate over whatever is between them. Especially light nodes if we ever get them

I also think that we can discuss by voice but discussion conclusion must be written here.

Sure!

thesz commented 5 years ago

There is a thing that is Lustre cluster file system. It works for hundreds of machines. The author once told me that they had a bug lurking in their code for several years - they haven't tested UDP resend, because, as it appear, they haven't lost a single UDP packet in more than seven years of active use of Lustre in various environments.

Lustre implements Paxos for highly loaded environments, so it is quite close to what we are doing here.

No, it will not result of reimplementation of TCP in UDP. The logic to work with UDP will be different. What if you lost some of messages? With UDP you have to think about it.

I also think that we can discuss by voice but discussion conclusion must be written here.

thesz commented 5 years ago

There is a thing that is Lustre cluster file system. It works for hundreds of machines. The author once told me that they had a bug lurking in their code for several years - they haven't tested UDP resend, because, as it appear, they haven't lost a single UDP packet in more than seven years of active use of Lustre in various environments.

Lustre implements Paxos for highly loaded environments, so it is quite close to what we are doing here.

No, it will not result of reimplementation of TCP in UDP. The logic to work with UDP will be different. What if you lost some of messages? With UDP you have to think about it.

PS The size of Ethernet frame is 1522 octets: https://en.wikipedia.org/wiki/Ethernet_frame

IP header takes 20 bytes, UDP header takes 16 bytes, the remaining data size is 1486.

thesz commented 5 years ago

It looks like AF_NET6-bound socket unable to receive AF_NET4 produced datagrams, even when it is allowed to do so (by setting appropriate socket option).

The branch is here: https://github.com/hexresearch/thundermint/tree/225-adding-UDP-transport

Also take a look here: https://docs.google.com/document/d/1QXXoQCxVOE6wSKPzbDeZu1aL0PIJGdrEz3OtjkX0lrw/edit?usp=sharing (Russian) A proposal on how to use BitTorrent-like functionality for message exchange.

thesz commented 5 years ago

Added logs dumping, now I see that pingPong test case manage to send message from client to server, server also has been sending message back to client, but client cannot receive the answer.

The unavailability of network when sending answer back from server for IPv4 client is gone, but now IPv6 cannot send answer back.

Looking into what is wrong there.

thesz commented 5 years ago

Some effects of non-determinism.

When chunk size to split packets to is low (namely, 400 instead of 1400 I'd like to use as default) and there is only one job (cabal new-run --enable-tests -- thundermint-tests -j1 -t5s ) I see no timeouts at all. For same chunk size and 8 jobs (default for my laptop) I see timeouts for packet sizes from 256K upward.

Chunk size 1400 produces non-deterministic timeouts around packet sizes of 0.5M..1M, regardless number of jobs.

Chunk size of 6400 produces A LOT of deserialization errors. Investigating that.

thesz commented 5 years ago

Deserialization errors was caused by mismatch between split chunk size and receive (recvFrom) size. When I aligned them (recvFrom now gets twice as many bytes to receive as chunk size), things vastly improved.

thesz commented 5 years ago

thundermint-coin-node test shell script shows progress (growing height) when UDP is enabled with --udp switch.

But logs differ: UDP produces more logs, showing that PDP cannot obtain all nodes for some reason. CPU utilization is also high: every process uses whole core, contrasting that with ~16% of CPU utilization per process with TCP.

Also, I broke tests of mock network.

These three issues are for work through next several days (till next Monday, I think).

thesz commented 5 years ago

"UDP produces more logs" - not excessively more. Logs for test runs for TCP weight 590K, for UDP 610K, it is 3% increase. The string "too few ... known conns" also shows in TCP, about same number of times.

Investigating CPU usage and tests for mock network.

thesz commented 5 years ago

Excessive CPU usage may be due to higher memory use for UDP: TCP version uses ~44K, UDP version manages to shot up to 200K..220K. That's a direction to look into!

thesz commented 5 years ago

Continious exchange of peers addresses was caused by too high limit of minimal number of peers - didn't find it so far.

thesz commented 5 years ago

Fixed the limit (in test program thundermint-coin-node). Now looking at the excessive CPU and memory consumption.

thesz commented 5 years ago

Change of import from Data.Map to Data.Map.Strict reduced memory consumption considerably.

PS CPU consumption also gone lower.

thesz commented 5 years ago

The amount of data allocated for UDP is 25 times that of TCP. That's interesting.

thesz commented 5 years ago

https://github.com/hexresearch/thundermint/commit/d4642b4ae33a65ef8c74bb26fb4b329e056b279f - this commit fixes connect-acknowledge sequence of events and now UDP works as good as TCP does, if not better (for UDP I observe slightly lower CPU utilization, for example).

thesz commented 5 years ago

I have to fix tests for mock network now.

thesz commented 5 years ago

Everything was fixed.

thesz commented 5 years ago

TCP time stats for local run (height 5):

real 8m45,578s user 1m22,479s sys 0m4,869s

real 8m44,588s user 1m23,853s sys 0m4,914s

real 8m46,537s user 1m26,206s sys 0m5,179s

real 8m45,667s user 1m24,588s sys 0m4,789s

UDP time stats: real 8m20,709s user 1m26,856s sys 0m5,033s

real 8m21,597s user 1m28,269s sys 0m5,025s

real 8m24,025s user 1m27,100s sys 0m5,201s

real 8m23,577s user 1m26,817s sys 0m5,241s

No real difference.

thesz commented 5 years ago

Heap (+RTS -sFILE) stats.

TCP: /home/sz/work/thunders/thundermint/dist-newstyle/build/x86_64-linux/ghc-8.4.4/thundermint-0.1/x/thundermint-coin-node/build/thundermint-coin-node/thundermint-coin-node --node-n 0 --listen-port 50001 --total-nodes 4 --max-h 5 --delay 50 --check-consensus --deposit 1000 --keys 2000 --peers ["127.0.0.1:50002","127.0.0.1:50003","127.0.0.1:50004"] +RTS -N -T -c -sheap 13,073,957,560 bytes allocated in the heap 2,260,676,528 bytes copied during GC 12,924,840 bytes maximum residency (254 sample(s)) 317,528 bytes maximum slop 35 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 10033 colls, 0 par 4.034s 3.936s 0.0004s 0.0021s Gen 1 254 colls, 0 par 2.189s 2.179s 0.0086s 0.0197s

TASKS: 35 (1 bound, 34 peak workers (34 total), using -N8)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.004s ( 0.002s elapsed) MUT time 66.587s (456.871s elapsed) GC time 6.222s ( 6.115s elapsed) EXIT time 0.002s ( 0.003s elapsed) Total time 72.816s (462.991s elapsed)

Alloc rate 196,342,780 bytes per MUT second

Productivity 91.4% of total user, 98.7% of total elapsed

gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0

/home/sz/work/thunders/thundermint/dist-newstyle/build/x86_64-linux/ghc-8.4.4/thundermint-0.1/x/thundermint-coin-node/build/thundermint-coin-node/thundermint-coin-node --node-n 1 --listen-port 50002 --total-nodes 4 --max-h 5 --delay 50 --check-consensus --deposit 1000 --keys 2000 --peers ["127.0.0.1:50003","127.0.0.1:50004"] +RTS -N -T -c -sheap 12,772,602,016 bytes allocated in the heap 1,908,457,152 bytes copied during GC 12,424,704 bytes maximum residency (245 sample(s)) 274,728 bytes maximum slop 35 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 9774 colls, 0 par 3.857s 3.737s 0.0004s 0.0021s Gen 1 245 colls, 0 par 2.016s 2.007s 0.0082s 0.0191s

TASKS: 38 (1 bound, 37 peak workers (37 total), using -N8)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.003s ( 0.001s elapsed) MUT time 66.633s (457.752s elapsed) GC time 5.873s ( 5.745s elapsed) EXIT time 0.003s ( 0.003s elapsed) Total time 72.512s (463.501s elapsed)

Alloc rate 191,686,004 bytes per MUT second

Productivity 91.9% of total user, 98.8% of total elapsed

gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0

/home/sz/work/thunders/thundermint/dist-newstyle/build/x86_64-linux/ghc-8.4.4/thundermint-0.1/x/thundermint-coin-node/build/thundermint-coin-node/thundermint-coin-node --node-n 2 --listen-port 50003 --total-nodes 4 --max-h 5 --delay 50 --check-consensus --deposit 1000 --keys 2000 --peers ["127.0.0.1:50004"] +RTS -N -T -c -sheap 12,197,610,800 bytes allocated in the heap 1,887,603,752 bytes copied during GC 12,323,248 bytes maximum residency (236 sample(s)) 273,048 bytes maximum slop 34 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 9263 colls, 0 par 3.644s 3.538s 0.0004s 0.0021s Gen 1 236 colls, 0 par 2.109s 2.098s 0.0089s 0.0236s

TASKS: 37 (1 bound, 36 peak workers (36 total), using -N8)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.005s ( 0.003s elapsed) MUT time 64.355s (457.084s elapsed) GC time 5.752s ( 5.636s elapsed) EXIT time 0.002s ( 0.008s elapsed) Total time 70.115s (462.732s elapsed)

Alloc rate 189,536,028 bytes per MUT second

Productivity 91.8% of total user, 98.8% of total elapsed

gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0

Fourth node didn't finish.

UDP: /home/sz/work/thunders/thundermint/dist-newstyle/build/x86_64-linux/ghc-8.4.4/thundermint-0.1/x/thundermint-coin-node/build/thundermint-coin-node/thundermint-coin-node --node-n 0 --listen-port 50001 --total-nodes 4 --max-h 5 --delay 50 --check-consensus --deposit 1000 --keys 2000 --peers ["127.0.0.1:50002","127.0.0.1:50003","127.0.0.1:50004"] --udp +RTS -N -T -c -sheap 12,352,234,072 bytes allocated in the heap 1,682,199,616 bytes copied during GC 13,999,568 bytes maximum residency (223 sample(s)) 781,000 bytes maximum slop 38 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 9389 colls, 0 par 3.892s 3.774s 0.0004s 0.0021s Gen 1 223 colls, 0 par 2.027s 2.013s 0.0090s 0.0208s

TASKS: 38 (1 bound, 37 peak workers (37 total), using -N8)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.006s ( 0.003s elapsed) MUT time 64.659s (454.234s elapsed) GC time 5.919s ( 5.787s elapsed) EXIT time 0.002s ( 0.008s elapsed) Total time 70.585s (460.032s elapsed)

Alloc rate 191,037,113 bytes per MUT second

Productivity 91.6% of total user, 98.7% of total elapsed

gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0

/home/sz/work/thunders/thundermint/dist-newstyle/build/x86_64-linux/ghc-8.4.4/thundermint-0.1/x/thundermint-coin-node/build/thundermint-coin-node/thundermint-coin-node --node-n 1 --listen-port 50002 --total-nodes 4 --max-h 5 --delay 50 --check-consensus --deposit 1000 --keys 2000 --peers ["127.0.0.1:50003","127.0.0.1:50004"] --udp +RTS -N -T -c -sheap 12,280,059,704 bytes allocated in the heap 2,157,058,504 bytes copied during GC 14,167,256 bytes maximum residency (223 sample(s)) 819,688 bytes maximum slop 39 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 9291 colls, 0 par 3.808s 3.718s 0.0004s 0.0022s Gen 1 223 colls, 0 par 2.079s 2.071s 0.0093s 0.0229s

TASKS: 37 (1 bound, 36 peak workers (36 total), using -N8)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.005s ( 0.003s elapsed) MUT time 64.472s (454.109s elapsed) GC time 5.886s ( 5.788s elapsed) EXIT time 0.002s ( 0.001s elapsed) Total time 70.366s (459.901s elapsed)

Alloc rate 190,470,369 bytes per MUT second

Productivity 91.6% of total user, 98.7% of total elapsed

gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0

/home/sz/work/thunders/thundermint/dist-newstyle/build/x86_64-linux/ghc-8.4.4/thundermint-0.1/x/thundermint-coin-node/build/thundermint-coin-node/thundermint-coin-node --node-n 2 --listen-port 50003 --total-nodes 4 --max-h 5 --delay 50 --check-consensus --deposit 1000 --keys 2000 --peers ["127.0.0.1:50004"] --udp +RTS -N -T -c -sheap 13,122,368,448 bytes allocated in the heap 2,430,176,304 bytes copied during GC 14,931,848 bytes maximum residency (239 sample(s)) 1,765,800 bytes maximum slop 40 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 10069 colls, 0 par 4.008s 3.908s 0.0004s 0.0022s Gen 1 239 colls, 0 par 2.107s 2.099s 0.0088s 0.0234s

TASKS: 35 (1 bound, 34 peak workers (34 total), using -N8)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.003s ( 0.002s elapsed) MUT time 62.781s (457.596s elapsed) GC time 6.116s ( 6.007s elapsed) EXIT time 0.003s ( 0.007s elapsed) Total time 68.903s (463.611s elapsed)

Alloc rate 209,016,833 bytes per MUT second

Productivity 91.1% of total user, 98.7% of total elapsed

gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0

/home/sz/work/thunders/thundermint/dist-newstyle/build/x86_64-linux/ghc-8.4.4/thundermint-0.1/x/thundermint-coin-node/build/thundermint-coin-node/thundermint-coin-node --node-n 3 --listen-port 50004 --total-nodes 4 --max-h 5 --delay 50 --check-consensus --deposit 1000 --keys 2000 --peers [] --udp +RTS -N -T -c -sheap 12,169,316,648 bytes allocated in the heap 1,923,754,728 bytes copied during GC 14,394,064 bytes maximum residency (223 sample(s)) 845,216 bytes maximum slop 39 MB total memory in use (0 MB lost due to fragmentation)

                                 Tot time (elapsed)  Avg pause  Max pause

Gen 0 9184 colls, 0 par 3.688s 3.585s 0.0004s 0.0022s Gen 1 223 colls, 0 par 1.925s 1.915s 0.0086s 0.0227s

TASKS: 35 (1 bound, 34 peak workers (34 total), using -N8)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.002s ( 0.001s elapsed) MUT time 62.935s (452.501s elapsed) GC time 5.614s ( 5.500s elapsed) EXIT time 0.002s ( 0.009s elapsed) Total time 68.553s (458.011s elapsed)

Alloc rate 193,363,367 bytes per MUT second

Productivity 91.8% of total user, 98.8% of total elapsed

gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0

The difference is negligible again.

Conclusion: UDP does not incur any significant cost.

thesz commented 5 years ago

https://docs.google.com/spreadsheets/d/1Rf_qcryFbvBQxFOKNdsbSftK7mymgmG_g4_eg3NN5dw/edit?usp=sharing

Height change for different transport and different delays.

It looks like UDP have some advantage when delay is bigger.