rqlite / gorqlite

A Go client for rqlite, the distributed database built on SQLite
MIT License
139 stars 34 forks source link

The correct & most efficient way to read/write binary data in rqlite #40

Closed axllent closed 3 months ago

axllent commented 3 months ago

This is a question rather than an issue with the library and/or server. I am the author of Mailpit and am experimenting with using rqlite as an optional storage backend for message storage.

Long story short, Mailpit keeps a compressed (zstd) copy of every email in a table. Using SQLIte I simply insert a string representation of the compressed data (and do the reverse for reading) - however this does not work with rsqlite and I suspect it has to do with the JSON encoding for the HTTP requests.

Encoding the string with base64 (base64.StdEncoding.EncodeToString(<bytes>) works fine in rqlite, but it is inefficient as the whole point of the zstd compression is to save space but base64 undoes that.

Do you have any suggestions of the best approach to minimise both the bandwidth and table storage size of the binary data and allow it to work with rqlite (ie: what encoding should I be using here if string won't work)?

Thanks!

otoolep commented 3 months ago

however this does not work with rsqlite and I suspect it has to do with the JSON encoding for the HTTP requests.

What do you mean it "doesn't work"? Can you be more specific, via example curl requests to the API?

but it is inefficient as the whole point of the zstd compression is to save space but base64 undoes that.

What do you mean "base64 undoes that"? Obviously base64 doesn't decompress the data. Do you mean encoding as base64 increases the size of the presentation of the data again, due to base64 encoding?

Have you reviewed this section of the API docs?

https://rqlite.io/docs/api/api/#blob-data

axllent commented 3 months ago

Sorry @otoolep , I should have been more specific. I am struggling to store & return binary data using the SQL driver (github.com/rqlite/gorqlite/stdlib), not curl.

If I store the data in the usual manner as per the link I referred to, I get a invalid input: magic number mismatch from zstd when decoding the returned data (due to byte differences between what was sent and what was received). fmt.Println() of the data before saving and after returning look identical, but the byte representations of the two actually differ which is what the zstd decoder fails on.

Thanks for the api blob data reference, however the problem here is really that I can't see how to apply this syntax when using the sql driver itself. In the following example, if stringOfBinary below is a string representation of the binary data (excluding the X'' syntax) generated by zstd.Encoder, but how should I go about adding the "X'' syntax" to this example?

db.Exec("INSERT INTO blobtest(id, data) values(?,?)", 1, stringOfBinary)

To answer your last question, yes, I can get it to work if I send a plain base64-encoded string to rqlite, however all I am effectively doing is storing a plain text string in the database which loses the benefit of the compression because base64 hugely inflates the data size. I'm not overly concerned if the JSON data from rsqlite is returned as base64, but I want to store the binary zstd data in the database's blob field and not a base64-encoded version of it.

Does that help explain my problem better?

axllent commented 3 months ago

I have finally found the answer, or at least a working solution which I assume is the correct approach:

hexData := hex.EncodeToString(<byte-slice-of-binary-data>)
db.Exec("INSERT INTO blobtest(id, data) values(?, x'" + hexData + "')", 1)

This just feels wrong though because one is effectively bypassing the prepared statement functionality and injecting the data directly into the query string, but no matter what I tried I couldn't assign the x'<hex>' via the prepared statement, so this must be the "correct" way. I am aware that hexadecimal encoding is SQL-safe (ie: can't contain values that would break the query).

I assume this approach is the correct Go-approach to saving binary data?

To save others some time and/or confusion, I would suggest a bit more info in the BLOB section of the docs to explain that the deadbeef is actually supposed to be the hexadecimal encoding of the binary data, wrapped in x'...'. I incorrectly assumed "deadbeef" was just a word you were trying to save in the BLOB field which threw me off. It also isn't very clear in the official SQLite docs but that's not your fault or problem.

otoolep commented 3 months ago

So I don't know this client library in depth, as I didn't write it, though I've made an occasional change to it. Personally I don't see anything wrong with what you're doing. I'm not sure if it's relevant, but rqlite doesn't support the concept of "prepared statement" (the HTTP API doesn't support it) so what you're seeing may just be that fact reflected in this client library implementation.

However I don't see any issue with what you're doing now. It should generate the correct HTTP request.

I'll update the docs as you suggest -- thanks.

axllent commented 3 months ago

Sweet, thanks for the feedback, and updating the documentation (hopefully helps anyone else in the same situation).

Sorry, "prepared statement" was the wrong terminology, I was actually meaning the ? variables/quoting (eg: db.Exec("INSERT INTO blaah (val1, val2) VALUES (?, ?)", 1, str)).

The good news is I'm making good progress with my implementation trial, and whilst there is obviously a performance decrease (compared to direct local SQLite access) due to the HTTP layer, it's still very fast.

I'll close this as you've answered my question. Thanks again!

otoolep commented 3 months ago

@axllent -- thinking more about this, I wonder if I should improve the rqlite API.

Basically, if data in the following Go code is a byte slice, then just do the right thing.

db.Exec("INSERT INTO blobtest(id, data) values(?,?)", 1, data)

This may require both changes to the rqlite HTTP API and client library, but I think this would be useful to folks.

axllent commented 3 months ago

@otoolep, well that's the thing, I don't believe a byte slice works natively anyway. Previously in my code data was a string (string(data)) to get it to work with the sql driver. This worked flawlessly with a local sqlite database - "binary string" in, "binary string" out... I just needed to convert the string back to a byte slice and I got a 1:1 match of the original data.

The issue with your implementation (I am fairly sure anyway) is that some of the string encoding is getting lost in the json encoding / decoding. I don't even think rqlite recognised what was going in as binary because the return definitely wasn't base64 encoded (just trying). On screen (fmt.Println) the "binary string" looked the same (including weird characters etc), but when one compared the byte slices there definitely were differences between the ordinal string and what was returned, which is why zstd was failing.

The alternative approach (x'') works for both implementations (native vs rqlite). It also generates data which is 100% identical to the original string(byte slice) implementation I had before (native) so I don't have any issues with existing native data. The only difference now between the two implementations is the binary data returned by rqlite is base64 encoded, which is acceptable as binary and json don't play nicely.

So long story short, I think that what you have is the best you can get given the json/http transport.

I still have one unrelated question, and one strange issue I'm facing, I'm just not sure where they should go (ie: which repo's issues to use), so I'll just ask here and you can hopefully redirect me if you feel they need their own separate issues (and where):

  1. Is http compression (gzip) used for the http transport, and if not then why? I didn't see any relating headers when using curl, but I admit I haven't looked much into it.
  2. I can't seem to execute a plain VACUUM command via the rqlite SQL driver (I get a "statement error" every time). Via curl and the rqlite cli client it works fine, but for reasons I cannot work out, the SQL driver just returned an error. Any ideas?
otoolep commented 3 months ago

Is http compression (gzip) used for the http transport, and if not then why? I didn't see any relating headers when using curl, but I admit I haven't looked much into it.

You mean in this client library? I do not know, I would to have to check. I do not know the Go HTTP server running inside rqlite supports it. Maybe it just does it automatically.

I can't seem to execute a plain VACUUM command via the rqlite SQL driver (I get a "statement error" every time). Via curl and the rqlite cli client it works fine, but for reasons I cannot work out, the SQL driver just returned an error. Any ideas?

I don't, I would need you to show me some sample code, such that I could run it for myself. That said, I suggest you get familiar with the HTTP API itself via curl commands. If you hit an issue using the client library, try to have a VACCUM work via a curl command, to ensure it works (VACUUM does work -- I've tested it myself).

https://www.philipotoole.com/rqlite-8170-8180-vacuum-jsonb/

otoolep commented 3 months ago

Also see https://rqlite.io/docs/guides/performance/#vacuum

axllent commented 3 months ago

Is http compression (gzip) used for the http transport, and if not then why? I didn't see any relating headers when using curl, but I admit I haven't looked much into it.

You mean in this client library? I do not know, I would to have to check. I do not know the Go HTTP server running inside rqlite supports it. Maybe it just does it automatically.

Yes, I mean the client (Go in my case). Assuming the server supports GZIP compression (I'm not entirely sure it does), I believe compression is not being used by the client (see this in relation to https://github.com/rqlite/gorqlite/blob/master/api.go as gzip compression is something that is usually requested by the client in a header, and depending on that, the server then conditionally compresses the response.

For Mailpit I had to write a specific HTTP middleware to conditionally add gzip compression to all HTTP requests (if requested by the client).

Anyway, that's worth asking because if there isn't compression then this could greatly improve network speeds (and reduce traffic), especially with larger payloads.

I can't seem to execute a plain VACUUM command via the rqlite SQL driver (I get a "statement error" every time). Via curl and the rqlite cli client it works fine, but for reasons I cannot work out, the SQL driver just returned an error. Any ideas?

I don't, I would need you to show me some sample code, such that I could run it for myself. That said, I suggest you get familiar with the HTTP API itself via curl commands. If you hit an issue using the client library, try to have a VACCUM work via a curl command, to ensure it works (VACUUM does work -- I've tested it myself).

https://www.philipotoole.com/rqlite-8170-8180-vacuum-jsonb/

Thanks. I'm aware of auto-scheduling VACUUM, but what I am referring to here is manually invoking a VACUUM by the client. The rqlite client works as expected, the curl request works as expected, but the Go sql client returns an error:

Curl:

curl -XPOST 'localhost:4001/db/execute?pretty' -H "Content-Type: application/json" -d '["VACUUM"]'
{
    "results": [
        {}
    ]
}

sqlite cli:

./rqlite
Welcome to the rqlite CLI.
Enter ".help" for usage hints.
Connected to http://127.0.0.1:4001 running version v8.23.0
127.0.0.1:4001> VACUUM;
0 row affected
127.0.0.1:4001>

Go sql client:

package main

import (
    "database/sql"

    _ "github.com/rqlite/gorqlite/stdlib"
)

func main() {
    db, err := sql.Open("rqlite", "http://localhost:4001")
    if err != nil {
        panic(err)
    }

    _, err = db.Exec("VACUUM")
    if err != nil {
        panic(err)
    }
}
go run .
panic: there were 1 statement errors

goroutine 1 [running]:
main.main()
        /home/ralph/rqlite-vacuum/main.go:17 +0x85
exit status 2
otoolep commented 3 months ago

Dug into the client library, this is what is happening (the code should make it clearer):

sections map[results:[map[error:cannot VACUUM from within a transaction]] time:0.015425357]

otoolep commented 3 months ago

Clearly something in the client library, probably doing a transaction when it doesn't need to.

otoolep commented 3 months ago

Try doing this on your connection object (I don't like the choices made by the client library, but it is what it is for now) -- unset the Transaction flag.

func (conn *Connection) SetExecutionWithTransaction(state bool) error {
        if conn.hasBeenClosed {
                return ErrClosed
        }
        conn.wantsTransactions = state
        return nil
}
axllent commented 3 months ago

@otoolep I do appreciate you looking into this, but I think this is just getting a bit too "hacky" for my liking, and it is something I can simply ignore and require a user running rqlite to set up auto-vacuuming if they want it.

Would this even work properly in rqlite as VACUUM can take minutes to complete - wouldn't the rqlite sql client potentially just timeout during this? Anyway, thanks again for your help, and apologies that this all followed within the original thread. I'll leave the HTTP compression thought with you for now as I really suspect the gorqlite client doesn't use compression to the rsqlite server.

Edit: actually I'm pretty sure there is no HTTP compression, at least not when accessing the /status API via a browser (missiing response of Content-Encoding: gzip): Screenshot from 2024-04-03 16-51-46