smee / binary

Clojure API for binary format I/O using java's stream apis
74 stars 10 forks source link

Terminated strings #2

Closed zsau closed 10 years ago

zsau commented 10 years ago

Some codecs use null-terminated strings whose length isn't known in advance, which is very awkward to parse at the moment. An optional :suffix or :terminator argument to string and/or repeated would be very useful.

smee commented 10 years ago

Sure, sounds useful. Do you know about a second example where this kind of feature might be useful? I can't think of something different from null-terminated strings?

Else I would lean to add a specific coded for this case, like

(defn c-string 
  "Zero-terminated string (like in C). String is a sequence of bytes, terminated by a 0 byte."
  [^String encoding]
  (reify BinaryIO
    (read-data [_ big-in _]
      (loop [bytes (transient [])]
        (let [b (.readByte ^DataInput big-in)]
          (if (zero? b)
            (String. (byte-array (persistent! bytes)) encoding)
            (recur (conj! bytes b))))))
    (write-data [_ big-out _ s]
      (.write ^DataOutput big-out (.getBytes ^String s))
      (.write ^DataOutput big-out (byte 0)))))
zsau commented 10 years ago

Some codecs use null characters as terminators rather than null bytes (ID3v2, for example). Unfortunately the null character isn't always one byte-- it's two bytes in UTF-16, for example.

zsau commented 10 years ago

Thanks, this is very helpful. I notice that when decoding with something like (repeated (string "UTF-8" :separator 0)), any bytes that come after the last null in the stream are read and ignored by the parser. Is there any way to access those bytes?

smee commented 10 years ago

So you say your codec keeps reading bytes after encountering the separator? Maybe you keep calling decode? I added a test that reads bytes separated by a null byte and verifies, if after each read the rest of the bytes in the stream are untouched. If you could give an example for buggy behaviour please file a new bug report.