LSM performance poor when batches are small

gladkikhartem commented 2 weeks ago

Setup Tigerbeetle cluster inside docker based on this Page: https://docs.tigerbeetle.com/operating/docker/ with a clean, fresh data storage.
Create 10k accounts using Golang client (v0.16.4) (with flag History:true)
Create a for loop making 1 transaction to tigerbeetle at a time (i.e. batch size = 1, concurrency = 1)

After cluster is Initially created latency distribution is following (ranges in milliseconds, measurement sample size = 1000);

1-6.7      84.9%  ██████████▏  849
6.7-12.4   7.2%   ▉            72
12.4-18.1  3.6%   ▌            36
18.1-23.8  0.3%   ▏            3
23.8-29.5  0%     ▏            
29.5-35.2  1%     ▏            10
35.2-40.9  1%     ▏            10
40.9-46.6  1.3%   ▏            13
46.6-52.3  0.5%   ▏            5
52.3-58    0.2%   ▏            2

Create 1M transactions in TB via 1000 batches of 1000 transactions

Repeat the measurement (batch size = 1, concurrency = 1) Now latency distribution looks like following (ranges in milliseconds, measurement sample size = 1000);

1-87.1       53.1%  ██████████▏  531
87.1-173.2   41.1%  ███████▊     411
173.2-259.3  4.1%   ▊            41
259.3-345.4  0.8%   ▏            8
345.4-431.5  0.2%   ▏            2
431.5-517.6  0.2%   ▏            2
517.6-603.7  0.1%   ▏            1
603.7-689.8  0%     ▏            
689.8-775.9  0.1%   ▏            1
775.9-862    0.3%   ▏            3

Docker stats:

fd603a23591d   wallet_tigerbeetle_1_1   16.67%    2.396GiB / 13.34GiB   17.96%    0B / 0B       3.4GB / 27.2GB    10
8f0ddc1dc408   wallet_tigerbeetle_2_1   16.59%    2.405GiB / 13.34GiB   18.02%    0B / 0B       3.41GB / 27.2GB   10
ebf04163809a   wallet_tigerbeetle_0_1   17.12%    2.388GiB / 13.34GiB   17.89%    0B / 0B       3.39GB / 27.2GB   11

As you see TigerBeetle has written 27GB! of data for 1 million transactions inserted (1000 batches of 1000). It also consumes 15% CPU on 6-core machine for making like 10 RPS.

If I make just 1000 transactions (1 transaction in 1 batch) TigerBeetle will write to disk additional 17! GB of data. (instead of 10KB - size of 1000 transactions)

CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O       BLOCK I/O         PIDS
fd603a23591d   wallet_tigerbeetle_1_1   16.63%    2.355GiB / 13.34GiB   17.65%    0B / 0B       3.46GB / 44.5GB   11
8f0ddc1dc408   wallet_tigerbeetle_2_1   16.28%    2.362GiB / 13.34GiB   17.70%    0B / 0B       3.48GB / 44.5GB   10
ebf04163809a   wallet_tigerbeetle_0_1   17.41%    2.348GiB / 13.34GiB   17.60%    0B / 0B       3.46GB / 44.6GB   10

The sizes of databases are following:

du -sh data/*
8,1G    data/0_0.tigerbeetle
8,1G    data/0_1.tigerbeetle
8,1G    data/0_2.tigerbeetle

gladkikhartem commented 2 weeks ago

go version go1.21.0 linux/amd64

lsb_release -a No LSB modules are available. Distributor ID: Linuxmint Description: Linux Mint 21.1 Release: 21.1 Codename: vera

cat /etc/os-release NAME="Linux Mint" VERSION="21.1 (Vera)" ID=linuxmint ID_LIKE="ubuntu debian" PRETTY_NAME="Linux Mint 21.1" VERSION_ID="21.1" HOME_URL="https://www.linuxmint.com/" SUPPORT_URL="https://forums.linuxmint.com/" BUG_REPORT_URL="http://linuxmint-troubleshooting-guide.readthedocs.io/en/latest/" PRIVACY_POLICY_URL="https://www.linuxmint.com/" VERSION_CODENAME=vera UBUNTU_CODENAME=jammy

matklad commented 1 week ago

If I make just 1000 transactions (1 transaction in 1 batch) TigerBeetle will write to disk additional 17! GB of data. (instead of 10KB - size of 1000 transactions)

We do have known space amplification issues with partially full batched! Applications should try to submit full batches, if possible.

That being said, we do have some mitigations in place for this already, and this seems unexpected! Although I am wondering if 0.16.4 predates that?

It also consumes 15% CPU on 6-core machine for making like 10 RPS.

Right now, TigerBeetle is expected to more or less pin one CPU core regardless of load.

Now latency distribution looks like following (ranges in milliseconds, measurement sample size = 1000);

This looks roughly expected --- we are at hundreds of ms per batch worst case at the moment.

gladkikhartem commented 1 week ago

@matklad Thanks for clarifying the expectations!

gladkikhartem commented 1 week ago

@matklad Tested with latest v0.16.13 version - a bit more stable, but generally - same behavior 3M transactions DB size - inserting sequential batches with 1000 transactions in 1 batch. Concurrency=1:

6-96.6       50.2%  ██████████▏  502
96.6-187.2   41%    ████████▏    410
187.2-277.8  5%     █            50
277.8-368.4  1%     ▏            10
368.4-459    0.8%   ▏            8
459-549.6    1.2%   ▎            12
549.6-640.2  0.2%   ▏            2
640.2-730.8  0.2%   ▏            2
730.8-821.4  0.3%   ▏            3
821.4-912    0.1%   ▏            1

3M transactions DB size - inserting sequential batches with 1 transaction in 1 batch. Concurrency=1:

1-48.3       50.1%  ██████████▏  501
48.3-95.6    15.3%  ███▏         153
95.6-142.9   29.3%  █████▉       293
142.9-190.2  2.7%   ▋            27
190.2-237.5  0.5%   ▏            5
237.5-284.8  0.1%   ▏            1
284.8-332.1  0.7%   ▏            7
332.1-379.4  0.6%   ▏            6
379.4-426.7  0.5%   ▏            5
426.7-474    0.2%   ▏            2

gladkikhartem commented 1 week ago

I really don't know why you decided to reinvent your own wheel with storage. There are plenty of embeded DBs like RocksDB or even SQLITE. With Zig you can just import C library and use it natively.

Just made a quick test and writing 10K transactions per batch using https://github.com/cockroachdb/pebble gives me

29-33.76     5.7%   █▊           57
33.76-38.52  28.8%  █████████▏   288
38.52-43.28  31.9%  ██████████▏  319
43.28-48.04  13%    ████▏        130
48.04-52.8   7.1%   ██▎          71
52.8-57.56   5.7%   █▊           57
57.56-62.32  1.5%   ▌            15
62.32-67.08  0.1%   ▏            1
67.08-71.84  0.1%   ▏            1

Basically 10X more load with 10X less latency.

Writing 1 transaction at a time gives

3-3.11       94.5%  ██████████▏

i.e. 50X less latency.

I'm sure RocksDB will be even faster. Not even saying that it has no limits on data structure, so that you can metadata and extend your storage layer without any issues. Backup tools are also available and tested for RocksDB. It also deals with all the little quirks related to how each FS handles flush guarantees and stuff like that to ensure that writes are correctly persisted on disk.

You can also have SQLITE without any problems too, because you don't delete data. So no fragmentation & other related issues. It also has decent performance is used correctly

matklad commented 1 week ago

I really don't know why you decided to reinvent your own wheel with storage.

That's an excellent question! The main explanation is that any database is a result of:

one hour napkin sketching at the start to figure out the right high level design, and
a decade of work to fill the sketch with all the details to make it as fast as the design allows.

We do think that TigerBeetle design is qualitatively better for tracking financial transactions! We are at the start of our building-the-database decade, but we are sure in the end result.

These two things about design explain why we are not just slapping Raft on top of SQLite:

We want TigerBeetle to be resilient to disk corruption. There were some discussions recently about how SQLite doesn't use checksums. This makes total sense for SQLite, which is a single-node embedded database, its just not much you can do if your single disk goes toast. But, if you are building a replicated database, it is just wasteful to not take advantage of cluster redundancy to repair disk faults.

But fixing that requires that the storage layer and consensus layer are aware of each other, you can't just compose this out of SQLite and a Raft library. See this paper and this talk for details.

In terms of code, the fact that this callback doesn't have error case is what motivates quite a bit of design here!
We want TigerBeetle to do accounting really fast, and for that we need to move business logic over to the database, as SQL-mediated row-locks across network are the bottleneck for highly-contended workloads. This part we could've done by building on top of some other storage engine, but we also want to have our storage resilience!

So these are the reasons why we do everything from zero, rather than assembling existing building blocks: we need different bricks to get to the level of safety and performance we want.

That being said, it takes a lot of work to build storage engine. Today, of course PostgreSQL, RocksDB, SQLite are more optimized, more featureful, and have more bells and whistles.

It is totally fine if TigerBeetle doesn't quite fulfil your requirements at the moment, you don't have to use it. If the PostgreSQL ledger works for you today, go for it!

gladkikhartem commented 1 week ago

@matklad I looked at the paper and watched the video. The ideas are good and make sense, but there's a big gap between what's advertised and what is actually implemented in the code I wouldn't waste my time bothering you with all those issues on Github if the product would have performed the way it's advertized on website. I hope this advertisement gap will close itself over time and the issues with the TigerBeetle are only limited to performance and not to reliability of durability.

I still think you can close this gap simply by using RocksDB as a storage layer. Yes - ideally you want to just recover the sector that was corrupted by using your own storage layer and in theory it will be faster and better. But in reality - It's much easier to have simple system and just to throw hardware at the problem - maintain a copy of data on same host or just replicate whole DB from another node. Or have 5 raft nodes instead of 3. Hardware is cheap, developers aren't. Complex reliability solutions will actually produce the opposite result - bugs, glitches, lack of people to maintain them, etc... "The road to hell is paved with good intentions."

Right now TB performs so bad that I have to choose other solution, even with a really huge desire to try out TigerBeetle in prod.

jorangreef commented 1 week ago

I wouldn't waste my time bothering you with all those issues on Github if the product would have performed the way it's advertized on website.

We appreciate the 4 issues you opened. To be fair, 3 of the 4 were opened in the last 2 weeks and are already closed:

2468 came down to intentional misconfiguration of TigerBeetle, passing in the network topology intentionally in the wrong order, as you described in the issue ("If order of Tigerbeetle hosts is incorrect - it will always have high latencies"). While TigerBeetle is able to work around this, you have to expect increased latencies if you're giving TigerBeetle the wrong addresses for machines in the cluster.
2467 and #2448 were the result of either running more clients than the maximum allowed limit, or else running an extremely old client version (0.15.3) against an incompatible server version (the release notes document version compatibility where applicable). Nevertheless, the TigerBeetle client detected both operator misconfigurations and panicked the client with a session eviction error. This behavior of client session eviction is documented. We've furthermore added a callback in #2484 to let you handle these client session evictions yourself even if there's no chance the client will be able to connect.

In other words, these are not discrepancies from what is advertised, but rather a result of intentional misconfiguration or not following the docs. In addition, your CTO emailed us about this list of issues, and we provided similar explanation.

Right now TB performs so bad that I have to choose other solution, even with a really huge desire to try out TigerBeetle in prod.

The only remaining issue here then is this specific issue of write amplification for a cluster under low load, where batch sizes are being artificially and intentionally limited to 1 transfer per batch (i.e. batch size = 1, concurrency = 1).

While we're interested in this from a write amplification perspective, and again, while we're looking into it and will likely resolve it soon, from a performance perspective this is not representative of a cluster under load, where batch sizes will naturally approach 8K transfers per batch.

If you believe that another system is able to provide better debit/credit performance than TigerBeetle when under load, for example, able to process 10 billion debit/credit transfers (and not simple key/value inserts—comparing apples to apples) with better performance, then you should use it!

I still think you can close this gap simply by using RocksDB as a storage layer. Yes - ideally you want to just recover the sector that was corrupted by using your own storage layer and in theory it will be faster and better. But in reality - It's much easier to have simple system and just to throw hardware at the problem

In fact, it's not only a matter of cost, but also correctness. The literature has shown that combining Raft and RocksDB in the way you describe is not safe. For example, a single sector error on one machine can propagate through the consensus protocol and lead to global cluster data loss. This is why TigerBeetle goes to such lengths, to co-design the global consensus protocol and local storage engine. We would rather get the foundational design right, and we're happy to invest in it for the long term.

For this reason, we're more likely to resolve this specific issue soon with further write amplification optimizations, than to replace TigerBeetle's entire local storage engine and global consensus protocol with non-deterministic, non-storage fault-tolerant, dynamic implementations, which were explicitly not designed to provide the same guarantees that TigerBeetle provides.

In the interest of mutual respect, I'm going to lock this issue until the issue of write amplification under low load is resolved.

tigerbeetle / tigerbeetle

LSM performance poor when batches are small #2469