Open gladkikhartem opened 2 weeks ago
go version go1.21.0 linux/amd64
lsb_release -a No LSB modules are available. Distributor ID: Linuxmint Description: Linux Mint 21.1 Release: 21.1 Codename: vera
cat /etc/os-release NAME="Linux Mint" VERSION="21.1 (Vera)" ID=linuxmint ID_LIKE="ubuntu debian" PRETTY_NAME="Linux Mint 21.1" VERSION_ID="21.1" HOME_URL="https://www.linuxmint.com/" SUPPORT_URL="https://forums.linuxmint.com/" BUG_REPORT_URL="http://linuxmint-troubleshooting-guide.readthedocs.io/en/latest/" PRIVACY_POLICY_URL="https://www.linuxmint.com/" VERSION_CODENAME=vera UBUNTU_CODENAME=jammy
If I make just 1000 transactions (1 transaction in 1 batch) TigerBeetle will write to disk additional 17! GB of data. (instead of 10KB - size of 1000 transactions)
We do have known space amplification issues with partially full batched! Applications should try to submit full batches, if possible.
That being said, we do have some mitigations in place for this already, and this seems unexpected! Although I am wondering if 0.16.4 predates that?
It also consumes 15% CPU on 6-core machine for making like 10 RPS.
Right now, TigerBeetle is expected to more or less pin one CPU core regardless of load.
Now latency distribution looks like following (ranges in milliseconds, measurement sample size = 1000);
This looks roughly expected --- we are at hundreds of ms per batch worst case at the moment.
@matklad Thanks for clarifying the expectations!
@matklad Tested with latest v0.16.13 version - a bit more stable, but generally - same behavior 3M transactions DB size - inserting sequential batches with 1000 transactions in 1 batch. Concurrency=1:
6-96.6 50.2% ██████████▏ 502
96.6-187.2 41% ████████▏ 410
187.2-277.8 5% █ 50
277.8-368.4 1% ▏ 10
368.4-459 0.8% ▏ 8
459-549.6 1.2% ▎ 12
549.6-640.2 0.2% ▏ 2
640.2-730.8 0.2% ▏ 2
730.8-821.4 0.3% ▏ 3
821.4-912 0.1% ▏ 1
3M transactions DB size - inserting sequential batches with 1 transaction in 1 batch. Concurrency=1:
1-48.3 50.1% ██████████▏ 501
48.3-95.6 15.3% ███▏ 153
95.6-142.9 29.3% █████▉ 293
142.9-190.2 2.7% ▋ 27
190.2-237.5 0.5% ▏ 5
237.5-284.8 0.1% ▏ 1
284.8-332.1 0.7% ▏ 7
332.1-379.4 0.6% ▏ 6
379.4-426.7 0.5% ▏ 5
426.7-474 0.2% ▏ 2
I really don't know why you decided to reinvent your own wheel with storage. There are plenty of embeded DBs like RocksDB or even SQLITE. With Zig you can just import C library and use it natively.
Just made a quick test and writing 10K transactions per batch using https://github.com/cockroachdb/pebble gives me
29-33.76 5.7% █▊ 57
33.76-38.52 28.8% █████████▏ 288
38.52-43.28 31.9% ██████████▏ 319
43.28-48.04 13% ████▏ 130
48.04-52.8 7.1% ██▎ 71
52.8-57.56 5.7% █▊ 57
57.56-62.32 1.5% ▌ 15
62.32-67.08 0.1% ▏ 1
67.08-71.84 0.1% ▏ 1
Basically 10X more load with 10X less latency.
Writing 1 transaction at a time gives
3-3.11 94.5% ██████████▏
i.e. 50X less latency.
I'm sure RocksDB will be even faster. Not even saying that it has no limits on data structure, so that you can metadata and extend your storage layer without any issues. Backup tools are also available and tested for RocksDB. It also deals with all the little quirks related to how each FS handles flush guarantees and stuff like that to ensure that writes are correctly persisted on disk.
You can also have SQLITE without any problems too, because you don't delete data. So no fragmentation & other related issues. It also has decent performance is used correctly
I really don't know why you decided to reinvent your own wheel with storage.
That's an excellent question! The main explanation is that any database is a result of:
We do think that TigerBeetle design is qualitatively better for tracking financial transactions! We are at the start of our building-the-database decade, but we are sure in the end result.
These two things about design explain why we are not just slapping Raft on top of SQLite:
We want TigerBeetle to be resilient to disk corruption. There were some discussions recently about how SQLite doesn't use checksums. This makes total sense for SQLite, which is a single-node embedded database, its just not much you can do if your single disk goes toast. But, if you are building a replicated database, it is just wasteful to not take advantage of cluster redundancy to repair disk faults.
But fixing that requires that the storage layer and consensus layer are aware of each other, you can't just compose this out of SQLite and a Raft library. See this paper and this talk for details.
In terms of code, the fact that this callback doesn't have error case is what motivates quite a bit of design here!
We want TigerBeetle to do accounting really fast, and for that we need to move business logic over to the database, as SQL-mediated row-locks across network are the bottleneck for highly-contended workloads. This part we could've done by building on top of some other storage engine, but we also want to have our storage resilience!
So these are the reasons why we do everything from zero, rather than assembling existing building blocks: we need different bricks to get to the level of safety and performance we want.
That being said, it takes a lot of work to build storage engine. Today, of course PostgreSQL, RocksDB, SQLite are more optimized, more featureful, and have more bells and whistles.
It is totally fine if TigerBeetle doesn't quite fulfil your requirements at the moment, you don't have to use it. If the PostgreSQL ledger works for you today, go for it!
@matklad I looked at the paper and watched the video. The ideas are good and make sense, but there's a big gap between what's advertised and what is actually implemented in the code I wouldn't waste my time bothering you with all those issues on Github if the product would have performed the way it's advertized on website. I hope this advertisement gap will close itself over time and the issues with the TigerBeetle are only limited to performance and not to reliability of durability.
I still think you can close this gap simply by using RocksDB as a storage layer. Yes - ideally you want to just recover the sector that was corrupted by using your own storage layer and in theory it will be faster and better. But in reality - It's much easier to have simple system and just to throw hardware at the problem - maintain a copy of data on same host or just replicate whole DB from another node. Or have 5 raft nodes instead of 3. Hardware is cheap, developers aren't. Complex reliability solutions will actually produce the opposite result - bugs, glitches, lack of people to maintain them, etc... "The road to hell is paved with good intentions."
Right now TB performs so bad that I have to choose other solution, even with a really huge desire to try out TigerBeetle in prod.
I wouldn't waste my time bothering you with all those issues on Github if the product would have performed the way it's advertized on website.
We appreciate the 4 issues you opened. To be fair, 3 of the 4 were opened in the last 2 weeks and are already closed:
In other words, these are not discrepancies from what is advertised, but rather a result of intentional misconfiguration or not following the docs. In addition, your CTO emailed us about this list of issues, and we provided similar explanation.
Right now TB performs so bad that I have to choose other solution, even with a really huge desire to try out TigerBeetle in prod.
The only remaining issue here then is this specific issue of write amplification for a cluster under low load, where batch sizes are being artificially and intentionally limited to 1 transfer per batch (i.e. batch size = 1, concurrency = 1)
.
While we're interested in this from a write amplification perspective, and again, while we're looking into it and will likely resolve it soon, from a performance perspective this is not representative of a cluster under load, where batch sizes will naturally approach 8K transfers per batch.
If you believe that another system is able to provide better debit/credit performance than TigerBeetle when under load, for example, able to process 10 billion debit/credit transfers (and not simple key/value inserts—comparing apples to apples) with better performance, then you should use it!
I still think you can close this gap simply by using RocksDB as a storage layer. Yes - ideally you want to just recover the sector that was corrupted by using your own storage layer and in theory it will be faster and better. But in reality - It's much easier to have simple system and just to throw hardware at the problem
In fact, it's not only a matter of cost, but also correctness. The literature has shown that combining Raft and RocksDB in the way you describe is not safe. For example, a single sector error on one machine can propagate through the consensus protocol and lead to global cluster data loss. This is why TigerBeetle goes to such lengths, to co-design the global consensus protocol and local storage engine. We would rather get the foundational design right, and we're happy to invest in it for the long term.
For this reason, we're more likely to resolve this specific issue soon with further write amplification optimizations, than to replace TigerBeetle's entire local storage engine and global consensus protocol with non-deterministic, non-storage fault-tolerant, dynamic implementations, which were explicitly not designed to provide the same guarantees that TigerBeetle provides.
In the interest of mutual respect, I'm going to lock this issue until the issue of write amplification under low load is resolved.
After cluster is Initially created latency distribution is following (ranges in milliseconds, measurement sample size = 1000);
Docker stats:
As you see TigerBeetle has written 27GB! of data for 1 million transactions inserted (1000 batches of 1000). It also consumes 15% CPU on 6-core machine for making like 10 RPS.
If I make just 1000 transactions (1 transaction in 1 batch) TigerBeetle will write to disk additional 17! GB of data. (instead of 10KB - size of 1000 transactions)
The sizes of databases are following: