[]byte encoded in a Text field does not decode

kisielk commented 12 years ago

If you insert a []byte in a Text field in a database, querying the same field later returns the encoded version. I guess this is because decode() doesn't deal with t_text at all. I noticed that when encoding values of type string this is not a problem, but I would prefer not to have to convert all my []byte to string in all cases.

I'm not sure what the best way to solve this problem would be or if this is just working as designed.

bmizerany commented 12 years ago

Can you submit a failing test on a topic branch?

On Fri, Nov 2, 2012 at 2:19 PM, Kamil Kisiel notifications@github.comwrote:

If you insert a []byte in a Text field in a database, querying the same field later returns the encoded version. I guess this is because decode() doesn't deal with t_text at all. I noticed that when encoding values of type string this is not a problem, but I would prefer not to have to convert all my []byte to string in all cases.

I'm not sure what the best way to solve this problem would be or if this is just working as designed.

— Reply to this email directly or view it on GitHubhttps://github.com/bmizerany/pq/issues/53.

kisielk commented 12 years ago

Can do. However, I'm having this initial problem when running the tests:

--- FAIL: TestEncodeDecode (0.00 seconds) conn_test.go:172: pq: R:"report_invalid_encoding" M:"invalid byte sequence for encoding \"UTF8\": 0x00" L:"1780" F:"wchar.c" H:"This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by \"client_encoding\"." S:"ERROR" C:"22021" --- FAIL: TestErrorOnQuery (0.00 seconds) conn_test.go:314: pq: S:"ERROR" L:"907" P:"1" R:"base_yyerror" C:"42601" F:"scan.l" M:"syntax error at or near \"DO\"" FAIL exit status 1 FAIL github.com/bmizerany/pq 0.130s

bmizerany commented 12 years ago

@fdr, can you help here?

On Fri, Nov 2, 2012 at 2:46 PM, Kamil Kisiel notifications@github.comwrote:

Can do. However, I'm having this initial problem when running the tests:

--- FAIL: TestEncodeDecode (0.00 seconds) conn_test.go:172: pq: R:"report_invalid_encoding" M:"invalid byte sequence for encoding \"UTF8\": 0x00" L:"1780" F:"wchar.c" H:"This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by \"client_encoding\"." S:"ERROR" C:"22021" --- FAIL: TestErrorOnQuery (0.00 seconds) conn_test.go:314: pq: S:"ERROR" L:"907" P:"1" R:"base_yyerror" C:"42601" F:"scan.l" M:"syntax error at or near \"DO\"" FAIL exit status 1 FAIL github.com/bmizerany/pq 0.130s

— Reply to this email directly or view it on GitHubhttps://github.com/bmizerany/pq/issues/53#issuecomment-10031333.

fdr commented 11 years ago

@kisielk Hello. A bit late, but better than never. I believe these tests rely on 9.1 or 9.2+, when the DO statement was introduced. Knowing that, would you be so kind as to try to write your reproduction again?

kisielk commented 11 years ago

Thanks @fdr , it does indeed work with Postgres 9.1

@bmizerany I've added a failing test case at https://github.com/kisielk/pq/tree/bytes

Please have a look.

fdr commented 11 years ago

Okay, I understand now....

So the issue is that pq notes only the Go dynamic data type when deciding how to do escaping for the purposes of ingress to the server: it doesn't actually make sure the target parameter is the relevant data type that can receive that escaped representation, so if the determined encoding just-so-happens to be compatible with what the actual data type in the server is, it will happily insert the representation intended for some other data type. However, when doing the decoding, the opposite is true: it only cares about the Postgres reported data type, since that's the only information it has to work with.

I think the surprise factor here is that data intended to be formatted for bytea has instead wound up in a text column. It is not correct to do that: I think the right response is to signal an error immediately if one tries to cram a Go []byte into a Postgres text. What do you think of that?

Doing that will require some complexity that I will investigate the feasibility of. What I have in mind is making the encoding function more complicated to be able to verify against the inferred parameter data types returned from the server after the response to the Parse message.

fdr commented 11 years ago

Also, thank you very much for the nicely bundled test case. It made figuring this out so much easier.

kisielk commented 11 years ago

I'm not entirely convinced an error is the best solution.

In my case what I was trying to do is read text from a file and insert it in to a text column in the database. So I was using io.ReadFull to get the contents of the file in to a []byte and then insert the slice. Right now I have to perform an extra conversion to a string, which would be nice to avoid as it requires creating another copy of the data just to be able to pass it to the Exec function which will then just copy it again when it converts it to []byte in encode().

Would it not be possible to just pass the []byte without the additional base-16 string encoding if the target is a text field?

fdr commented 11 years ago

I'm not entirely convinced an error is the best solution.

Would it not be possible to just pass the []byte without the additional base-16 string encoding if the target is a text field?

Upon more meditation, I think you are right. But some of the details need investigating, but it shouldn't take too long (not that I'm committing to do the work myself Real Soon) to hash out some test cases to do so...

In particular, I'm interested in what Postgres thinks of bound parameters of strings that contain NUL bytes. I have did some research and I'm more charitable to not raising an error in this case, since Go's string has nothing to do with being human-readable text (i.e. actually encoded in any way in particular) in any felicitous way. I also think that given that Postgres does support many encodings and even switching the default interpretation in a session (but not the storage format, which is bound at initdb time) that the protocol has to robust to unexpected NUL bytes appearing (e.g. UCS-2 coming in when UTF-8 was expected).

A quick exposition via psql showing that Postgres disallows NULs when inserted via escaping:

fdr=> SELECT E'\u0040'::text;
 text 
------
 @
(1 row)

fdr=> SELECT E'\u0000'::text;
ERROR:  invalid Unicode escape value at or near "E'\u0000"
LINE 1: SELECT E'\u0000'::text;
               ^

Examples of playing with NUL-byte infected bytestrings in Go:

http://play.golang.org/p/CbPXJCDKRZ

package main

import "fmt"

func main() {
    var foo [3]byte
    foo[0] = 'b'
    foo[1] = 0
    foo[2] = 'e'
    fmt.Printf("%v\n", foo[:])
    fmt.Printf("%v\n", string(foo[:]))
    fmt.Printf("%v\n", []byte(string(foo[:])))

    for _, r := range(string(foo[:])) {
        fmt.Printf("%T %q ", r, r)
    }
    fmt.Println()
}

The mechanical thorn of how to fix this remains somewhat similar, which is noticing the target type in addition to the run-time Go type.

fdr commented 11 years ago

So I think without some more enhancements (or a counter-intuitive example) this looks to present a problem:

func (st *stmt) exec(v []driver.Value) {
    w := newWriteBuf('B')
    w.string("")
    w.string(st.name)
    w.int16(0) // <---- This
    w.int16(len(v))
    for _, x := range v {
        if x == nil {
            w.int32(-1)
        } else {
            b := encode(x)
            w.int32(len(b))
            w.bytes(b)
        }
    }
    w.int16(0)
    st.cn.send(w)
        ....

As far as I can tell, it looks like pq only uses text representation, and the input function for text uses a cstring as the wire representation, i.e. one must escape NUL bytes. Otherwise, SQL injection attacks via NUL byte injection could arise. Interestingly, this also seems to apply to textrecv, which would be used with the binary representation.

Thus, I think we're stuck traversing the []byte and string for NUL bytes. It's unclear to me right now if this could be used for NUL byte injection attacks. The following is a very contrived example:

  SELECT * FROM my_secret_data WHERE thing LIKE $1;

Where $1 is bound to: user-controlled-prefix%controlled-by-me-suffix%

With a nul byte injection, one could write:

Where $1 is bound to: %\x00%controlled-by-me-suffix% which is effectively like saying:

  SELECT * FROM my_secret_data WHERE thing LIKE 'whateverIwant%';

fdr commented 11 years ago

I think this may be prevented by some of the server's validation code, but it's not obvious that it complains about cstrings that have surplus length (and are smaller than that length) in their parameter header (looking at pqformat). Will have to try it...

fdr commented 11 years ago

Nope. Looks like things are okay. 48e71d69012454ddfee72fb374ee9eee187675fc

So...moving back to enhancements rather than security exposure...

fdr commented 11 years ago

A cut of that committed. Please review it. https://github.com/bmizerany/pq/commit/6235e1b99dd9fc489a92596e9a524ea6775c08a9

kisielk commented 11 years ago

Great, thanks for the help in resolving this :)

lib / pq

[]byte encoded in a Text field does not decode #53