storing a string Go value into bytea PostgreSQL column works, but reading it back fails

mitar commented 7 months ago

Is your feature request related to a problem? Please describe.

With recent Go improvements to convert strings from and to []byte without copying I would like storing strings into bytea column works transparently, with as little copying as possible.

I am writing a key-value store on top of PostgreSQL and I would like to simplify storing of built-in strings so that users of my library would not have to convert them to []byte.

Describe the solution you'd like

It looks like storing a string Go value works already, but reading back fails with:

cannot scan bytea (OID 17) in binary format into *string

I do not see much reason why scanning bytea into string would not just work for standard types with pgx? Go allows conversion between string and []byte with simple casting, I would hope that storing into a string into the database would work the same.

Describe alternatives you've considered

My understanding is that I could register my own type codec, but I could not find much information about how to an existing codec: I would like that default bytea handling stays the same, only that strings are additionally scanned without copying.

jackc commented 6 months ago

I suspect there are unresolvable ambiguities regarding string <> bytea.

Writing the string to a PostgreSQL bytea at least appears to work because pgx sends any string to PostgreSQL in the text format without any conversion. pgx does the same thing when reading. Any text formatted value can be scanned directly into a string without any conversion.

But what happens when you read the value? In the text format, PostgreSQL will return the value hex encoded, but in binary you will get back the raw bytes.

For example: if you write the 4 bytes jack to a bytea you will get back the ten bytes \x6a61636b in the text format but the original 4 bytes in binary format.

Even working directly in psql it can be confusing dealing with strings in bytea. I recommend that you instead use text or varchar if possible.

My understanding is that I could register my own type codec, but I could not find much information about how to an existing codec: I would like that default bytea handling stays the same, only that strings are additionally scanned without copying.

If you really need/want to do this, you probably don't have to create a new Codec. I think you would only need to create your own type that implements pgtype.BytesScanner and pgtype.BytesValuer.

mitar commented 6 months ago

Thanks for the response.

If you really need/want to do this, you probably don't have to create a new Codec. I think you would only need to create your own type that implements pgtype.BytesScanner and pgtype.BytesValuer.

But that would require one to scan into this custom type, not string itself.

For string itself, it seems like I would have to unregister ByteaCodec and register a custom one? Am I mistaken? Or is there an easier way? Maybe just using a TryWrapScanPlanFuncs?

mitar commented 6 months ago

I am just thinking that since Go 1.20, there is cheap way to convert between string and []byte:

func String2ByteSlice(str string) []byte {
    return unsafe.Slice(unsafe.StringData(str), len(str))
}

func ByteSlice2String(bs []byte) string {
    if len(bs) == 0 {
        return ""
    }
    return unsafe.String(unsafe.SliceData(bs), len(bs))
}

So maybe pgx could just simply convert between them as necessary.

jackc commented 6 months ago

For string itself, it seems like I would have to unregister ByteaCodec and register a custom one? Am I mistaken? Or is there an easier way? Maybe just using a TryWrapScanPlanFuncs?

Yeah, either of these is a possibility, though you will still have the ambiguity of text vs. binary protocol.

I am just thinking that since Go 1.20, there is cheap way to convert between string and []byte:

I don't really see where these could be safely used. The read buffer is reused for every row. The unsafely constructed string would be corrupted by future rows.

mitar commented 6 months ago

I don't really see where these could be safely used. The read buffer is reused for every row. The unsafely constructed string would be corrupted by future rows.

Oh, you are right, for scanning it does not work.

vphatfla commented 1 month ago

Little late to the party but this work for me @mitar

var pw []byte
err := DBPool.QueryRow(context.Background(), fmt.Sprintf(query, user_table_quoted), email).Scan(&pw)
return string(pw), err

mitar commented 1 month ago

@vphatfla Yes, I would like that it works for Go string values, not Go []byte values (which already works).

vphatfla commented 1 month ago

@mitar, oh I see, I'm noob here, might I ask how expensive it is to repeatedly convert from []byte to string and vice versa?

mitar commented 1 month ago

With recent changes in Go it is very cheap. But the issue here is about ergonomics, if pgx does that internally, then it can do it a) efficiently b) without developer having to do it manually every time.

vphatfla commented 1 month ago

But isn't it make more sense to force developers to convert the string to []byte before saving to the postgres instead? My thinking is that, the data type in postgres is BYTEA, so logically, it should expect the []byte to be saved and retrieved instead of string. Doing it this way might help the new developers have a clear view of the system and thus, make fewer bugs in the future.

mitar commented 1 month ago

Sure. This is a design question for the library. It storing string into bytea an error or a feature. This is why the issue exists so that the community and maintainers can discuss the design here.

My view is that I would prefer that this is done by library automatically. Generally developers should have good understanding what is the column type and what is data type they are storing and library should try to accommodate and make this easy (especially when it is (now) possible to do it efficiently) to keep boilerplate and repeated code down.

jackc commented 1 month ago

My view is that I would prefer that this is done by library automatically. Generally developers should have good understanding what is the column type and what is data type they are storing and library should try to accommodate and make this easy (especially when it is (now) possible to do it efficiently) to keep boilerplate and repeated code down.

I agree, if it can be done safely. But as I mentioned above (https://github.com/jackc/pgx/issues/1977#issuecomment-2053856959), I don't know how it can be done without possibility of silently scanning the data incorrectly.

jackc / pgx

storing a string Go value into bytea PostgreSQL column works, but reading it back fails #1977