Unpacked string bytes are null-padded, which affects string comparisons

jblackman commented 8 years ago

This is an interesting side-effect. If an encoded character array is shorter than the field length, it will be padded out with byte(0), as you would expect. When you unpack into a string, these extra bytes make it into the string variable's underlying byte array. As a result, a comparison of this string with a string literal will return false.

For example:

package main

import (
    "bytes"
    "fmt"
    "github.com/lunixbochs/struc"
)

type Test struct {
    Str string `struc:"[10]byte"`
}

func main() {
    payload := "Foo"
    payloadSlice := make([]byte, 10)
    copy(payloadSlice, []byte(payload))
    buf := bytes.NewBuffer(payloadSlice)
    t := &Test{}
    if err := struc.Unpack(buf, t); err != nil {
        fmt.Println("Unpack failed:", err)
        return
    }
    if t.Str == "Foo" {
        fmt.Println("Comparison is OK")
    } else {
        fmt.Println(t.Str, "does not equal Foo")
        fmt.Println("t.Str = ", []byte(t.Str))
    }
}

If you run this, you will get:

Foo does not equal Foo
t.Str =  [70 111 111 0 0 0 0 0 0 0]

So, is it worth fixing this? IMO, yes. Hunting into field.go, we could alter this line:

val.SetString(string(buf))

perhaps to:

val.SetString( bytes.Trim(buf, "\x00") )

or, if that will add too much overhead, loop to the first null and use a truncated slice:

idx := bytes.IndexByte(buf, 0)
if idx > length {
    idx = length
}
val.SetString(buf[:idx])

wdyt?

lunixbochs commented 8 years ago

duplicate of #31

you could want to unpack a value into a string which legitimately contains null characters.

jblackman commented 8 years ago

A fairly edge case, though - wouldn't you generally be unpacking that into a byte array?

lunixbochs commented 8 years ago

Strings can be slightly easier to work with. The main difference in Go is mutability. We shouldn't force C's shortcomings on every application - someone might want to, say, unpack then decode utf16 data, which wouldn't work if I stopped at the first null byte.

jblackman commented 8 years ago

A very good point.

jchv commented 8 years ago

Wait, that argument doesn't really hold. UTF-16 would very logically be []uint16 - and if you were to ignore that fact, you'd be in for some pain when trying to use I.E. unicode/utf16. Further, while it is explicitly true that Go strings are not defined to be UTF-8 data, using them to hold arbitrary data very seldom makes sense; as you probably know, performing a range over a string actually pulls one codepoint at a time, making the assumption that the string is in fact UTF-8.

But I don't think that's the crux of the problem here. I think we need to think about this statement more:

We shouldn't force C's shortcomings on every application

On its own, true: that wouldn't make any sense to do so. However, it's worth noting that this is not that. Supporting C-style strings using the string type would be at most an idiosyncrasy rather than a limitation, given that otherwise []byte would work fine. On the other hand, I think there's a compelling argument for why it should, and it's in the title of the library and first line of the README:

Struc exists to pack and unpack C-style structures from bytes

And I'd argue C is the primary reason why we're jamming null-terminated strings into fixed-size arrays, making it compelling to support the case better.

At the same time, I also would shy away from making this sort of magic happen automatically, especially since people already use the library. Instead, I am in favor of supporting a flag that would turn on this behavior.

Disclaimer and shameless plug: I developed a similar library that behaves the same in this case. It does not support a flag for this either, as of writing.

jblackman commented 8 years ago

In general, the application that originated the structure will be quite prescriptive about the encoding of the string/byte array. It also happens (has happened to me) that the rule may change (as they choose to adopt a more recent encoding standard rather than chucking the whole mess away and using JSON). I would be more inclined to introduce an "encoding" tag of some sort to manage the transition, and give the developer some sugar. For example:

 Str string `struc:"[100]byte,utf16"`

which would say that you're receiving 100 bytes, to be interpreted as UTF-16 (and null padding stripped accordingly). The default could then be 8-bit ASCII and null padding auto-stripped.

lunixbochs / struc

Unpacked string bytes are null-padded, which affects string comparisons #42