phiryll / lexy

Lexicographical Byte Order Encodings
MIT License
0 stars 0 forks source link

Explicitly define Codec.Read error behavior details, ensure implementions comply #31

Closed phiryll closed 2 months ago

phiryll commented 3 months ago

Update 2

Added prefixes for nilsFirst/Last, which simplifies a lot of this, because an encoded value can never be 0 bytes.

The gist (documented on Codec.Read) is:

Read will read from r until either it has all the data it needs, or r stops returning data. r.Read is permitted to return only immediately available data instead of waiting for more. This may cause an error, or it may silently return incomplete data, depending on this Codec's implementation. Implementations of Read should never knowingly return incomplete data.

If the returned error is non-nil, including io.EOF, the returned value should be discarded. Read will only return io.EOF if r returned io.EOF and no bytes were read. Read will return io.ErrUnexpectedEOF if r returned io.EOF and a complete value was not successfully read.

Update

Codec.Read semantics: If error is non-nil, including io.EOF, the value is meaningless. A non-zero value may be returned for some aggregate types, but it should not generally be trusted. Read should return io.EOF only if the underlying Reader returned io.EOF and no bytes were read. Therefore, Codec implementations should ignore io.EOF if a value was successfully read. If an io.EOF happens after reading an incomplete value, Read should return io.ErrUnexpectedEOF.

Added byte prefixes for empty (0x03) and non-empty (0x04) values. 0x02 is reserved for nil, but encoding nil as zero bytes is working so far. These prefix values ensure that nil < empty < non-empty for encoded values. There should be no encoded bytes for an empty value following the empty prefix.

This does mean the underling io.Reader will return io.EOF if reading a valid encoded nil. For types which can be nil, Codec.Read should return no error in this case. Therefore, the encoding for a variable number of possibly-nil values must be delimited in some way.

The below table summarizes Codec.Read behavior for types using empty/non-empty prefixes.

nil     | empty   | prefix read | io.Read || 
allowed | allowed | by io.Read  | error   || Codec.Read returns
--------|---------|-------------|---------||--------------------
yes     | -       | 0 bytes     | nil     || nil, nil
yes     | -       | 0 bytes     | io.EOF  || nil, nil
yes     | -       | 0 bytes     | other   || nil, err
--------|---------|-------------|---------||--------------------
no      | -       | 0 bytes     | nil     || zero, io.ErrUnexpectedEOF
no      | -       | 0 bytes     | io.EOF  || zero, io.ErrUnexpectedEOF
no      | -       | 0 bytes     | other   || zero, err
--------|---------|-------------|---------||--------------------
-       | yes     | empty       | nil     || empty, nil
-       | yes     | empty       | io.EOF  || empty, nil
-       | yes     | empty       | other   || zero, err
--------|---------|-------------|---------||--------------------
-       | no      | empty       | -       || zero, error TBD
--------|---------|-------------|---------||--------------------
-       | -       | non-empty   | nil     || Read continues...
-       | -       | non-empty   | io.EOF  || zero, io.ErrUnexpectedEOF
-       | -       | non-empty   | other   || zero, err

Original

There is currently no way to disambiguate two possibilities when Read returns a zero value and io.EOF:

  1. a zero value was read just before EOF
  2. EOF was reached immediately, no data was read

This is a problem for types whose zero value is encoded as zero bytes. Take these for example:

[]string{}   encode=> []
[]string{""} encode=> []

[][]string{{}, {}}     encode=> [0x00]
[][]string{{}, {""}}   encode=> [0x00]
[][]string{{""}, {}}   encode=> [0x00]
[][]string{{""}, {""}} encode=> [0x00]

This problem results from the combination of string and slice encodings, not exactly either individually.

phiryll commented 3 months ago

Error semantics must be:

Returning zero and EOF means no value was read and you're at EOF.

If a zero value was read and also EOF returned within a Codec's implementation, return the value but with no error. Subsequent reads should return a zero value and EOF.

Or more simply: If error is non-nil, including io.EOF, the value is meaningless.

If an io.EOF happens after reading some but not all the bytes, Read returns io.ErrUnexpectedEOF.

phiryll commented 3 months ago

To disambiguate a zero value from no value, they must have different encodings. The reasonable solution is that no value (nil) is encoded as zero bytes, and a zero value needs to have some non-empty encoding.

The rest of this is only relevant if a zero value normally would have an empty encoding. For example, uint8 has a non-empty encoding for zero, since that encoding is one byte for all values. This seems to only be relevant for some variable length encodings (but not all, big.Int, e.g.).

The zero value must sort before all non-zero values. In general, the first byte could be 0 (delim) for some variable length encoding. For the zero value to sort before that, it must be just [0]. To disambiguate this from the encoding for "\x00", the latter must either be more than one byte, or it's first byte must be greater than 0.

One alternative is to always escape the contents. This will encode "\x00" as [1, 0]. It might be a good idea to always 0-terminate (unescaped) as well. This would have the side benefit of being able to fail if a string is only partially read.

Another is to prefix the encoding with 0 (zero value) or 1 (non-zero value). This will result in smaller encodings than the escaping solution. So I think that's the winner.

phiryll commented 3 months ago

To disambiguate a zero value from no value, they must have different encodings. The reasonable solution is that no value (nil) is encoded as zero bytes, and a zero value needs to have some non-empty encoding.

Or no value is encoded as [0], zero as [1], and non-zero as [2, ...].

But lets try encoding no value as empty first, see if we run into any problems. And just in case, zero as [1] and non-zero as [2, ...].

phiryll commented 3 months ago

Slight adjustment so prefixes don't require escaping: no value as empty ([2] is reserved) zero as [3] non-zero as [4]

phiryll commented 3 months ago

Document this!

A little more clarity, encode:

This has been done for the string (nil not possible) and slice codecs.

Read behavior

phiryll commented 2 months ago