Open gembin opened 2 weeks ago
Thanks!
You should absolutely never access the DB file directly. The problem with this is that it would bring all nodes out of sync. All actions other than reading data must go through the Raft for consistency across all nodes.
I would highly suggest to embed it into your application like in the examples, yes. In this scenario, you would benefit the most from running SQLite in comparison to a traditional database like Postgres. When you start the node, you will get a hiqlite::Client
back. This client handles everything automatically like making sure executes are always running on the leader, switching leaders if necessary, handling network issues and re-connects and it opens a connection pool behind the scenes as well. You will also get many functions to work with the database with the least amount of overhead in comparison to existing solutions like simple async wrappers for rusqlite
for instance.
The other way of connecting would be using hiqlite::Client::remote()
like shown in the bench example. This client can connect to a cluster that is running standalone or embedded into another application. You should only use this for good reason though, because a remote client loses the ability to do super fast local reads, which is one of the biggest advantages of this system.
Thanks a lot for your quick response and detailed answer!
Embedding it into an application makes sense to me. What about these use cases?
hiqlite::Client::remote()
, or should app2 be part of the cluster if it's on a different host?hiqlite::start_node()
. This will also create a new network / server for you which you could route over different physical wires, as Hiqlite can saturate a 1GBit/s connection if your machine is fast enough.Thanks for your answers.
hiqlite::start_node()
will create its own network per server, which doesn't seem very scalable, especially if there are thousands of databases.hiqlite::start_node() will create its own network per server, which doesn't seem very scalable
Actually, the opposite is true. Hiqlite can saturate a 1GBit/s network connection with only database traffic. These are cheap, 10G and up are not. So you could just add multiple smaller networks without losing throughput.
(..) especially if there are thousands of databases.
If you want thousands of databases on the same host, SQLite is not the one for you imho, at least not in this configuration. Turso does something like this, or lets you do this, but you would not have nearly the same level of consistency and the reason you typically create multiple DB's with SQLite is to overcome the single writer issue. This is not such an issue with Hiqlite becuase of the internal design.
In tests it can easily reach over 30k write TPS. Let's assume you are inserting rather small datasets with just a few columns with 56 bytes per row (the TPS don't really go down even if you write 10 times this data with each query), then you would end up writing ~5.7 GB per hour into your database. If you do things like that, you absolutely should consider running a big, dedicated Postgres cluster or something like that.
If you would write time series data for instance (which you should use a proper DB for), and you can write data in batches, Hiqlite lets you easily write more than 50GB per hour.
If you want to "scale" beyond that, a small footprint and saving resources is not your primary goal, run a proper, dedicated Postgres cluster imho. You should run a DB pool with hundreds of clients while each connection uses 10MB memory on the DB. If you have a pool of 500 connections it would need 5GB just for handling the connections, but at the same time you are writing very high amounts of data.
Hiqlite makes it easy to get rid of a central, big database server all applications rely on. If you run multiple services, each of them could just embed their own Hiqlite database. A SELECT
using Hiqlite is ~10 times faster than it is on a Postgres in a fast network (or any other network database) even on the same host, because you won't have the network in between. That's its strength, not scaling to thousands of databases on the same host.
But even if you would do that, thousands of networks on the same host are not bad, it would just mean ~2000 connected ports, which is a rather small number for a server. It is so efficient, that even if you would spin up 1000 x hiqlite::start_node()
, it would use less ressources than a Postgres DB with 1000 client connections. The other thing you must not forget is, that you could start each of these instances using another SSD under the hood.
I don't see a reason to avoid using ORM
I think ORMs are a super bad idea.
struct
. The thing is, that is simple with an ORM (after you learned the DSL), but it is simple with just SQL as well.struct
s which map to rows or tables, but you often don't need all values each time. Since you are using an ORM, you would go the easy way and fetch the whole row instead of just SELECT
ing the parts you really need, which leads to wasted ressources and bandwidth again. Edit:
Regarding the "thousands of networks is not scaleable" issue. If you have a Postgres Cluster with 1000 client connections, you would have the same amount of "networks" btw. The things is, it does not really make much sense to go higher than 1000 with Postgres because they start to scale less beyond ~500 usually. With Hiqlite, you could spawn 2000 databases, which means 4000 network connections, while each of them would have the full throughput without any degradation. And you would be left with ~60k ports you could use on that machine.
Wow, interesting discussion, thanks for your insights!
It seems that you're more focused on performance and throughput. One of the key advantages of SQLite is its simplicity—it's just a file, which makes maintenance and deployment straightforward and cost-effective. So, I think maintainability and ease of use are also important factors.
With one database per tenant, we can fully isolate each tenant's data, making it easier to scale as needed (e.g., if only some tenants have large data volumes). This approach also simplifies tasks like exporting, migrating, and deleting data etc. without affecting other tenants. However, if we need to manage thousands of configurations per application, it will add unnecessary complexity. For example: imagine trying to test a 3-node cluster on a laptop with thousands of databases—port conflicts could easily arise; and opening an SSL connection for each network seems like unnecessary overhead, and having so many open ports could be a security risk, troubleshooting could become more difficult if something goes wrong with the network/raft layer or even some wrong configurations. I'm aware of Turso database has a namespace feature, which allows multiple databases on the same server while reusing the network. So my question will be: why can't hiqlite reuse the same network, and what are the concerns/trade-offs?
Regarding ORM, I've never encountered significant performance issues, and it's unlikely to be the bottleneck in most cases. ORM doesn't mean we need to learn new things if we dont want try new features. For example, we have JPA standard in Java which is backward-compatible(frameworks like spring-data is even more simple), it eliminates most of boilerplate code and should suffice for most use cases. It also makes query portable across different databases (e.g., if we want to switch to another db in some cases), and it also allows selective field retrieval which avoid fetching the entire records. Raw SQL requires different dialects for different databases (with variations in syntax, data types, etc.), and also have more boilerplate code and more error-prone. With regard to sqlite, I think the performance impact caused by the ORM abstraction layer is negligible especially when working with local read/write operations. While I'm not looking to debate ORM further, it's clear that both approaches come with their own trade-offs.
One issue with dashboard.
Both succeeded
create table test (id int)
insert into test values(1)
Caused panic
select * from test;
2024-09-04T02:14:52.685322Z INFO post_query: hiqlite::dashboard::query: dashboard query:
select * from test;
thread 'tokio-runtime-worker' panicked at hiqlite/src/query/rows.rs:128:23:
internal error: entered unreachable code: unreachable column type: INT
[1] 2963 abort target/release/hiqlite serve
Query directly from db file, the record is there
sqlite> .open hiqlite.db
sqlite> select * from test;
1
sqlite>
It seems that you're more focused on performance and throughput.
Yes, always. After I migrated a big Java project at my old job to Rust and was able to reduce the operational cost by ~95%, I am always focussed and being as efficient as possible. I also often write code for IoT projects, which have very limited resources, so I even have to.
One of the key advantages of SQLite is its simplicity—it's just a file, which makes maintenance and deployment straightforward and cost-effective.
That's true, but unfortunately not anymore with Hiqlite or when using a Raft. I mean, it has other advantages like high availability and even more consistency, but you need more than 1 file now. Compared to other solutions you will have a way lower footprint though which keeps it very cost-effective.
So my question will be: why can't hiqlite reuse the same network, and what are the concerns/trade-offs?
You are right and I get your point. And it would be technically pretty easy to do. For instance, I did this for the cache layer. You could take a look at the cache example. It is using an enum
to differentiate between different caches. This could be done with a String
of course, but an enum is, again, a lot more efficient. It makes it possible to use a simple Vec<_>
for the caches under the hood and select by index, instead of hashing a String to get a value out of a HashMap with each single request.
I could do the same for databases, yes. The tradeoffs would be:
hiqlite::Client
, because you would always need to specify the target database in each single function. In probably almost all situation where you would use Hiqlite, a single databases is more than enough. This means in in all these situations, you would need to write more code, have less throughput and higher latency.In the end, everything is possible, if you are okay with the tradeoffs.
It would also be possible to implement the State Machine logic twice, so you could do both - use it with a single DB or allow a dynamic creation and decision at runtime. This is quite a bit of work to implement properly though.
However, if we need to manage thousands of configurations per application (...)
True. But would also make the whole deployment more secure by having real separation for database traffic. For such an approach, I would design the whole system in a different way. I would never run all of these from a single application or backend, but instead start a separate container for each one, when you want separation anyway. This would make resource usage and handling a lot easier. You could have limits on each instance and make sure that one database (customer) is not using all your memory, and so on.
For example: imagine trying to test a 3-node cluster on a laptop with thousands of databases—port conflicts could easily arise
Actually, I don't see it. Sure they could arise, but you have ~64.000 ports you can chose from. When you just run it inside docker, you can also fully encapsulate it from your host.
having so many open ports could be a security risk
Not at all. It would be the exact same risk as if you would have one port open, because the application using it is the same anyway. It does not increase your attack surface in this case. It is even a benefit (as well for debugging), because you can take action very specifically if something bad is happening.
One issue with dashboard.
Oh that's bad. It seems that SQLite is not normalizing the column types. I will open an issue about it later and check for other types.
If you just specify INTEGER
instead of INT
, I guess it should be fine. Unfortunately, the column type under the hood cannot be extracted type safe and must be parsed from a String
. I will add additional type checks there.
reduce the operational cost by ~95%
Congrats 🎉, that's a big deal! I'm doing something similar by porting some Java projects to Rust.
If done like for caches, it would be static and the decision how many databases you start will be done at compile time, not at runtime, for better speed and efficiency.
This can be useful. For example, we may want have one DB to store some data for internal components and another for customer data. It's certainly more performant than the dynamic approach.
If it would allow a dynamic creation, it would mean less throughput, higher latency, and a bit more complex code.
I guess it shouldn't have too much impact on performance. But this could be a useful feature. Is it possible to define some features like multi-static-db, multi-dynamic-db? So users can enable the features they need without affecting single database usage. But, of course, this would add complexity to the hiqlite implementation and require more effort.
I would never run all of these from a single application or backend, but instead start a separate container for each one, when you want separation anyway. This would make resource usage and handling a lot easier. You could have limits on each instance and make sure that one database (customer) is not using all your memory, and so on.
One container per customer could work, but it seems overkill for me. It consumes unnecessary system resources, increases costs, and also increase the complexity of traffic routing and also more complex to maintain. Separate storage is more ideal for my needs since all the nodes in the cluster are identical, making it easier to manage.
Two follow-up questions
Is there a way, provided out of box to know if the automatic backup in S3 succeeded or not, because it's critical to make sure we have latest backup for disaster recovery. Maybe need some metrics?
I did a tests for auto-backup, is this the correct way?
Step1. Config
HQL_BACKUP_CRON="0/5 * * * * *"
HQL_S3_URL=https://s3.us-east-1.amazonaws.com
HQL_S3_BUCKET=test-bucket
HQL_S3_REGION=us-east-1
HQL_S3_KEY=xxxx
HQL_S3_SECRET=xxxx
Step2. Start server, and the backup successfully pushed to S3 (auto-backup every 5s)
Step3. Stop server, delete ~/.hiqlite/data
dir, and start server again. I'm expecting it will automatically download the latest backup and restore it. But nothing is restored from S3, and new backups are pushed to S3.
Currently, there don’t seem to be any metrics emitted. How can we know if the cluster/server is working properly or is this supposed to be done at application level?
I guess it shouldn't have too much impact on performance. But this could be a useful feature. Is it possible to define some features like multi-static-db, multi-dynamic-db? So users can enable the features they need without affecting single database usage. But, of course, this would add complexity to the hiqlite implementation and require more effort.
That would be one solution, yes. The only other thing that came to my mind you must consider is, that when you run multiple DB's using the same Raft group, you would not be able to have independent backups. This is super critical. All DB's, even though split into multiple files and folders, must be treated as one big database in that case. All of them must be included in Raft Snapshots, because they all share the same Raft Logs. This also means, that it would be impossible to apply only a single backup to one of the DB's while leaving the others unaffacted. You can then either apply a backup to all DBs at once or to none at all.
That's just how Raft works by design. To make it super short and leave out a lot of important stuff to keep it simple:
rocksdb
currently for Hiqlite.The problem is, that you can't write Logs until end of time, you need to purge and clean up at some point. To achieve this, Raft will create snapshots of the data together with the information about what the latest applied Raft log ID is. When all nodes (!) have created this snapshot, the lowest common latest applied log ID will be the one up until all members can clean up their Logs storage.
But, if multiple State Machines share the same Raft Logs storage (and therefore same network, which is created at the same time), they must all be included in these snapshots. This means they would not be treated as individual, but as one big database.
I considered something like this just in case you end up with a performance bottleneck and you are limited by disk I/O, but then quickly forgot about it because it makes every part more complex and error prone. Additionally, while the Raft is creating a snapshot, you will be able to read from the State Machine, but writes will be paused until it finished (which is usually very quick).
I did not have the time to write more docs about tuning and so on, but you can control the amount of snapshots being created via NodeConfig.raft_config.max_in_snapshot_log_to_keep
. There is a shortcut using NodeConfig::default_raft_config(10_000)
which accepts this number as the only argument. The default value is 10_000
right now, meaning every 10.000 logs, Hiqlite will trigger a new snapshot creation and purge logs up until this point.
So I guess this makes the whole idea of sharing all "customer" databases a really bad idea anyway. It's just how Raft works.
Is there a way, provided out of box to know if the automatic backup in S3 succeeded or not (..)
There is no alerting or metrics yet, no. The upload process will run in a separate async task on the side. It will retry a few times but give up if it fails multiple times. You would see it in the logs, but not in any metrics so far. This needs to be added in the future.
I'm expecting it will automatically download the latest backup and restore it. But nothing is restored from S3, and new backups are pushed to S3.
That's not how it works. It would not even be able to detect this automatically, because, how should it know that you deleted something and it should pull a backup? If you have for instance a 3 node cluster, kill one of them, delete the whole volume and restart it, then it will fetch all missing data from the other running nodes. That's exactly what the above mentioned snapshots are for.
S3 backups are really for disaster recovery only. If just a volume is corrupted, it will not even bother reaching out to S3 and sync from the current leader node.
Currently, there don’t seem to be any metrics emitted. How can we know if the cluster/server is working properly or is this supposed to be done at application level?
You can use hiqlite::Client::is_healthy_db()
for a programmatic information. You can also use hiqlite::Client::metrics_db()
to get full Raft metrics. The same functions exist with *_cache
for the 2nd cache Raft network, if this feature is enabled.
Or you can use /health
endpoints on both servers that Hiqlite starts to make sure everything is healthy, just like shown in the Kubernetes StatefulSet
example definition for the livenessProbe
.
Thanks for the detailed explanation! Yeah, share same raft group will be a problem. It will need multi-raft group support which is not available in openraft I guess, but that makes the system even more complex.
If you have for instance a 3 node cluster, kill one of them, delete the whole volume and restart it, then it will fetch all missing data from the other running nodes.
This makes sense. Could you please expand the backup section to include more details on how to recover if the cluster loses quorum or if the entire cluster is lost?
I was just playing with it, and I can execute queries from the dashboard. Awesome work!
Question: How can a client connect to the server and run queries? Or just access the db file directly at
~/.hiqlite/data/state_machine/db/hiqlite.db
or need to embed it in my own application like this sqlite-only example?