tigerbeetle / tigerbeetle

The financial transactions database designed for mission critical safety and performance.
https://tigerbeetle.com
Apache License 2.0
10.7k stars 519 forks source link

LSM performance poor when batches are small #2469

Open gladkikhartem opened 2 weeks ago

gladkikhartem commented 2 weeks ago
  1. Setup Tigerbeetle cluster inside docker based on this Page: https://docs.tigerbeetle.com/operating/docker/ with a clean, fresh data storage.
  2. Create 10k accounts using Golang client (v0.16.4) (with flag History:true)
  3. Create a for loop making 1 transaction to tigerbeetle at a time (i.e. batch size = 1, concurrency = 1)

After cluster is Initially created latency distribution is following (ranges in milliseconds, measurement sample size = 1000);

1-6.7      84.9%  ██████████▏  849
6.7-12.4   7.2%   ▉            72
12.4-18.1  3.6%   ▌            36
18.1-23.8  0.3%   ▏            3
23.8-29.5  0%     ▏            
29.5-35.2  1%     ▏            10
35.2-40.9  1%     ▏            10
40.9-46.6  1.3%   ▏            13
46.6-52.3  0.5%   ▏            5
52.3-58    0.2%   ▏            2
  1. Create 1M transactions in TB via 1000 batches of 1000 transactions
  2. Repeat the measurement (batch size = 1, concurrency = 1) Now latency distribution looks like following (ranges in milliseconds, measurement sample size = 1000);
    1-87.1       53.1%  ██████████▏  531
    87.1-173.2   41.1%  ███████▊     411
    173.2-259.3  4.1%   ▊            41
    259.3-345.4  0.8%   ▏            8
    345.4-431.5  0.2%   ▏            2
    431.5-517.6  0.2%   ▏            2
    517.6-603.7  0.1%   ▏            1
    603.7-689.8  0%     ▏            
    689.8-775.9  0.1%   ▏            1
    775.9-862    0.3%   ▏            3

Docker stats:

fd603a23591d   wallet_tigerbeetle_1_1   16.67%    2.396GiB / 13.34GiB   17.96%    0B / 0B       3.4GB / 27.2GB    10
8f0ddc1dc408   wallet_tigerbeetle_2_1   16.59%    2.405GiB / 13.34GiB   18.02%    0B / 0B       3.41GB / 27.2GB   10
ebf04163809a   wallet_tigerbeetle_0_1   17.12%    2.388GiB / 13.34GiB   17.89%    0B / 0B       3.39GB / 27.2GB   11

As you see TigerBeetle has written 27GB! of data for 1 million transactions inserted (1000 batches of 1000). It also consumes 15% CPU on 6-core machine for making like 10 RPS.

If I make just 1000 transactions (1 transaction in 1 batch) TigerBeetle will write to disk additional 17! GB of data. (instead of 10KB - size of 1000 transactions)

CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O       BLOCK I/O         PIDS
fd603a23591d   wallet_tigerbeetle_1_1   16.63%    2.355GiB / 13.34GiB   17.65%    0B / 0B       3.46GB / 44.5GB   11
8f0ddc1dc408   wallet_tigerbeetle_2_1   16.28%    2.362GiB / 13.34GiB   17.70%    0B / 0B       3.48GB / 44.5GB   10
ebf04163809a   wallet_tigerbeetle_0_1   17.41%    2.348GiB / 13.34GiB   17.60%    0B / 0B       3.46GB / 44.6GB   10

The sizes of databases are following:

du -sh data/*
8,1G    data/0_0.tigerbeetle
8,1G    data/0_1.tigerbeetle
8,1G    data/0_2.tigerbeetle
gladkikhartem commented 2 weeks ago

go version go1.21.0 linux/amd64

lsb_release -a No LSB modules are available. Distributor ID: Linuxmint Description: Linux Mint 21.1 Release: 21.1 Codename: vera

cat /etc/os-release NAME="Linux Mint" VERSION="21.1 (Vera)" ID=linuxmint ID_LIKE="ubuntu debian" PRETTY_NAME="Linux Mint 21.1" VERSION_ID="21.1" HOME_URL="https://www.linuxmint.com/" SUPPORT_URL="https://forums.linuxmint.com/" BUG_REPORT_URL="http://linuxmint-troubleshooting-guide.readthedocs.io/en/latest/" PRIVACY_POLICY_URL="https://www.linuxmint.com/" VERSION_CODENAME=vera UBUNTU_CODENAME=jammy

matklad commented 1 week ago

If I make just 1000 transactions (1 transaction in 1 batch) TigerBeetle will write to disk additional 17! GB of data. (instead of 10KB - size of 1000 transactions)

We do have known space amplification issues with partially full batched! Applications should try to submit full batches, if possible.

That being said, we do have some mitigations in place for this already, and this seems unexpected! Although I am wondering if 0.16.4 predates that?

It also consumes 15% CPU on 6-core machine for making like 10 RPS.

Right now, TigerBeetle is expected to more or less pin one CPU core regardless of load.

Now latency distribution looks like following (ranges in milliseconds, measurement sample size = 1000);

This looks roughly expected --- we are at hundreds of ms per batch worst case at the moment.

gladkikhartem commented 1 week ago

@matklad Thanks for clarifying the expectations!

gladkikhartem commented 1 week ago

@matklad Tested with latest v0.16.13 version - a bit more stable, but generally - same behavior 3M transactions DB size - inserting sequential batches with 1000 transactions in 1 batch. Concurrency=1:

6-96.6       50.2%  ██████████▏  502
96.6-187.2   41%    ████████▏    410
187.2-277.8  5%     █            50
277.8-368.4  1%     ▏            10
368.4-459    0.8%   ▏            8
459-549.6    1.2%   ▎            12
549.6-640.2  0.2%   ▏            2
640.2-730.8  0.2%   ▏            2
730.8-821.4  0.3%   ▏            3
821.4-912    0.1%   ▏            1

3M transactions DB size - inserting sequential batches with 1 transaction in 1 batch. Concurrency=1:

1-48.3       50.1%  ██████████▏  501
48.3-95.6    15.3%  ███▏         153
95.6-142.9   29.3%  █████▉       293
142.9-190.2  2.7%   ▋            27
190.2-237.5  0.5%   ▏            5
237.5-284.8  0.1%   ▏            1
284.8-332.1  0.7%   ▏            7
332.1-379.4  0.6%   ▏            6
379.4-426.7  0.5%   ▏            5
426.7-474    0.2%   ▏            2
gladkikhartem commented 1 week ago

I really don't know why you decided to reinvent your own wheel with storage. There are plenty of embeded DBs like RocksDB or even SQLITE. With Zig you can just import C library and use it natively.

Just made a quick test and writing 10K transactions per batch using https://github.com/cockroachdb/pebble gives me

29-33.76     5.7%   █▊           57
33.76-38.52  28.8%  █████████▏   288
38.52-43.28  31.9%  ██████████▏  319
43.28-48.04  13%    ████▏        130
48.04-52.8   7.1%   ██▎          71
52.8-57.56   5.7%   █▊           57
57.56-62.32  1.5%   ▌            15
62.32-67.08  0.1%   ▏            1
67.08-71.84  0.1%   ▏            1

Basically 10X more load with 10X less latency.

Writing 1 transaction at a time gives

3-3.11       94.5%  ██████████▏  

i.e. 50X less latency.

I'm sure RocksDB will be even faster. Not even saying that it has no limits on data structure, so that you can metadata and extend your storage layer without any issues. Backup tools are also available and tested for RocksDB. It also deals with all the little quirks related to how each FS handles flush guarantees and stuff like that to ensure that writes are correctly persisted on disk.

You can also have SQLITE without any problems too, because you don't delete data. So no fragmentation & other related issues. It also has decent performance is used correctly

matklad commented 1 week ago

I really don't know why you decided to reinvent your own wheel with storage.

That's an excellent question! The main explanation is that any database is a result of:

We do think that TigerBeetle design is qualitatively better for tracking financial transactions! We are at the start of our building-the-database decade, but we are sure in the end result.

These two things about design explain why we are not just slapping Raft on top of SQLite:

So these are the reasons why we do everything from zero, rather than assembling existing building blocks: we need different bricks to get to the level of safety and performance we want.

That being said, it takes a lot of work to build storage engine. Today, of course PostgreSQL, RocksDB, SQLite are more optimized, more featureful, and have more bells and whistles.

It is totally fine if TigerBeetle doesn't quite fulfil your requirements at the moment, you don't have to use it. If the PostgreSQL ledger works for you today, go for it!

gladkikhartem commented 1 week ago

@matklad I looked at the paper and watched the video. The ideas are good and make sense, but there's a big gap between what's advertised and what is actually implemented in the code I wouldn't waste my time bothering you with all those issues on Github if the product would have performed the way it's advertized on website. I hope this advertisement gap will close itself over time and the issues with the TigerBeetle are only limited to performance and not to reliability of durability.

I still think you can close this gap simply by using RocksDB as a storage layer. Yes - ideally you want to just recover the sector that was corrupted by using your own storage layer and in theory it will be faster and better. But in reality - It's much easier to have simple system and just to throw hardware at the problem - maintain a copy of data on same host or just replicate whole DB from another node. Or have 5 raft nodes instead of 3. Hardware is cheap, developers aren't. Complex reliability solutions will actually produce the opposite result - bugs, glitches, lack of people to maintain them, etc... "The road to hell is paved with good intentions."

Right now TB performs so bad that I have to choose other solution, even with a really huge desire to try out TigerBeetle in prod.

jorangreef commented 1 week ago

I wouldn't waste my time bothering you with all those issues on Github if the product would have performed the way it's advertized on website.

We appreciate the 4 issues you opened. To be fair, 3 of the 4 were opened in the last 2 weeks and are already closed:

In other words, these are not discrepancies from what is advertised, but rather a result of intentional misconfiguration or not following the docs. In addition, your CTO emailed us about this list of issues, and we provided similar explanation.

Right now TB performs so bad that I have to choose other solution, even with a really huge desire to try out TigerBeetle in prod.

The only remaining issue here then is this specific issue of write amplification for a cluster under low load, where batch sizes are being artificially and intentionally limited to 1 transfer per batch (i.e. batch size = 1, concurrency = 1).

While we're interested in this from a write amplification perspective, and again, while we're looking into it and will likely resolve it soon, from a performance perspective this is not representative of a cluster under load, where batch sizes will naturally approach 8K transfers per batch.

If you believe that another system is able to provide better debit/credit performance than TigerBeetle when under load, for example, able to process 10 billion debit/credit transfers (and not simple key/value inserts—comparing apples to apples) with better performance, then you should use it!

I still think you can close this gap simply by using RocksDB as a storage layer. Yes - ideally you want to just recover the sector that was corrupted by using your own storage layer and in theory it will be faster and better. But in reality - It's much easier to have simple system and just to throw hardware at the problem

In fact, it's not only a matter of cost, but also correctness. The literature has shown that combining Raft and RocksDB in the way you describe is not safe. For example, a single sector error on one machine can propagate through the consensus protocol and lead to global cluster data loss. This is why TigerBeetle goes to such lengths, to co-design the global consensus protocol and local storage engine. We would rather get the foundational design right, and we're happy to invest in it for the long term.

For this reason, we're more likely to resolve this specific issue soon with further write amplification optimizations, than to replace TigerBeetle's entire local storage engine and global consensus protocol with non-deterministic, non-storage fault-tolerant, dynamic implementations, which were explicitly not designed to provide the same guarantees that TigerBeetle provides.

In the interest of mutual respect, I'm going to lock this issue until the issue of write amplification under low load is resolved.