smee / binary

Clojure API for binary format I/O using java's stream apis
74 stars 10 forks source link

Can't omit final separator when encoding #5

Closed zsau closed 10 years ago

zsau commented 10 years ago

(Related to Issue #3)

I'm having trouble encoding a fixed-length string sequence while omitting the final separator:

(defn fixed-string-seq [size]
    (padding (repeated (string "UTF-8" :separator 0)) size))

(let [out (java.io.ByteArrayOutputStream.)]
    (encode (fixed-string-seq 11) out ["abc" "def" "ghi"]))
; IllegalArgumentException Data should be max. 11 bytes, but attempting to write 0 bytes more!  org.clojars.smee.binary.core/padding/reify--1466 (core.clj:302)

The content should be exactly 11 bytes without the trailing null, but it seems the encoder doesn't like this.

smee commented 10 years ago

That's the problem with quick hacks: Commit b8ae7b2bda59de341153d18e976b8b54568a19bf introduced a more lenient parser for separated tokens in the codec repeated: The last separator can be omitted. Sadly, this breaks the assumption, that the binary after the roundtrip "read->write" stays the same. The codec always writes a separator, even if it is the last element. That means given your example above the codec tries to write a 0 after writing "ghi". That results in 12 bytes overall. Since this is bigger than the padding, an error occurs.

The error message looks wrong, though. It should say "...attempting to write 1 bytes more...". I should also update the documentation of repeated with :separator to state, that the last separator is only optional when reading, not when writing.

smee commented 10 years ago

The problem is: When reading for example zero ended strings, how do we know, that the last string has been read completely? Right now we just take the EOF as a signal to stop reading. When writing we could omit the last separator only if we are sure, that there is no other codec writing data afterwards. Because if there would be another one, the next read of this data will fail (this repeated codec would continue reading past the written string boundary since there is no separator token).

zsau commented 10 years ago

In my case, I'm working with a field that has a fixed length and contains some number of null-separated strings (but no final null). I can parse it perfectly thanks to b8ae7b2bda59de341153d18e976b8b54568a19bf, but I can't write the same data with the same codec right now (and this field is buried several levels deep in the overall protocol, so it's not trivial to just swap in a different codec for writing the field).

Ideally, repeated would omit the final separator when it runs out of space to write. Unfortunately, OutputStream doesn't have any concept of capacity, so there's no way for padding to communicate that limit to repeated given the current API, right?

What if repeated had an option to omit the final separator when writing? Alternatively, what if padding had an option to truncate its contents when writing? Either one would allow round-tripping with this type of protocol.

zsau commented 10 years ago

Another option would be for repeated to have two similar options, say :separator and :terminator, where the former means the byte comes between each pair of objects and the latter means it comes after each object.

smee commented 10 years ago

Optionally truncating the bytes written by padding seems to be the cleanest way, imo. Does this work for your case?

zsau commented 10 years ago

That's perfect. Thanks!