Investigate performance at high TX rates

Shimuuar commented 6 years ago

If transaction rate is too high to be processed thundermint starts to slow into craw with blocks taking longer and longer to commit:

commit_time

Following plot demonstrates state transitions for 4 nodes.

download

Heights are odd lines, and steps are even ones. It could be ween that prevotes take successively more times at first 10 height. Exact causes are not clear but it could be due to network congestion because node is not able to process incoming transactions

We need better (and cheap from performance standpoint) profiling tools
If that's so we need to change implementation of mock network since it has unbounded buffers and send always succeds
Probably we need to prioritize gossip of votes/blocks over gossip of transactions so that latter wouldn't slow down gossip of former

Shimuuar commented 5 years ago

Here is analysis of duration of different steps. Network run with significant load (avg. block size=130 tx). Yet it was stable for ~1400 blocks.

Entering new height

This one is supposed to be of fixed duration and its purpose is to collect straggler votes for previous block before creating new one Screenshot_20190712_193149

It's supposed to be δ-function at 500ms yet it shows exponential decay with 20ms constant. Questions:

Is delay caused by consensus engine being busy with something
Alternatively is threadDelay that jittery

Prevote

Prevote is quite interesting Screenshot_20190712_193748 Distribution peaks at around of ~200ms so propagation of messages (over loopback!) takes around that much time.

Precommit

It's even more interesting: Screenshot_20190712_194055

Notice peak at 0. That means that node enters commit immediately after entering precommit! (All blocks were committed at round 0)

Commiting block

Most damning picture is time spend committing block: Screenshot_20190712_194236

It's not clear whether we spend that much time clearing mempool (entirely possible) or writing to in-memory DB (doubtful)

Shimuuar commented 5 years ago

Задержка во времени коммита блока связана с фильтрацией мемпула. При увеличении потока транзакций в 3 раза задержка вырастает со 130 до 400.

Тогда как измерение действительно времени таймаута показывает, что threadDelay занимает не более чем 501 мс, а остальные они проводят внутри обработчиков. Второй источник тормозов найден

Shimuuar commented 5 years ago

Больше интересных. Теперь распределения времени обработки сообщений в движке госсипа. Достигнутая высота 717, что даёт ~20 сообщений на блок. Полные коммуникации дают 1+1+4+4 сообщений, т.е. оверхед от госсипа составляет около 2

Время ожидания сообщения

Что итересно мы видим два пика на 69 и 87 мкс. Доля сообщений, которых ждали меньше 10мкс мала (~3.5%) и, вероятно, соответсвует тому, что в канале уже лежит сообщение Screenshot_20190716_104529

Запись в WAL

Работа с WAL отжирает не очень много времени, в среднем (0.3мс) но сообщений много. Около половины из них дублируются и если фильтровать такие сообщения раньше, это может позволить увеличить производительность. Однако есть бОльшие проблемы с производительностью Screenshot_20190716_104547

Собственно обработка сообщений

Здесь творится полный ад. Время обработки сообщения может превышать полсекунды! Screenshot_20190716_104600

Shimuuar commented 5 years ago

Дальнейшее исследование медленных сообщений показало:

Сообщения, которые обрабатываются >10ms

    150 "RxTimeout"
     86 "RxPreVote"
     25 "RxPreCommit"
     14 "RxProposal"

Больше 100мс

     92 "RxTimeout" (!)
      2 "RxPreVote"

Кроме того самые медленные таймауты (более полный набор данных)

     24 "StepNewHeight"   -- → enterPropose
    120 "StepProposal"    -- → enterPrevote

В enterPropose мы генерируем новый блок, в enterPrevote валидируем его. Всё опять указывает на тормоза в обработке блока

Shimuuar commented 5 years ago

A lot of weird delays from plots above came from the fact that all nodes shared same core! Setting CPU affinity per process made things much more sane

I think that we have now general understanding what is delays. At high TX rate main sourse of delays are:

Block generations
Block verification
Mempool filtering

Everything else have minimal performance effect. Thus closing issue

hexresearch / hschain