kriszyp / cbor-x

Ultra-fast CBOR encoder/decoder with extensions for records and structural cloning
MIT License
272 stars 33 forks source link

Streams cannot handle root level null types #3

Open Bluebie opened 3 years ago

Bluebie commented 3 years ago

cbor-x Decoder stream will happily decode and push a null value in to the decoder stream, causing the stream to terminate. This can create nasty edge cases where users might control the contents of cbor sequences and are able to create large files which don't properly decode for cbor-x clients.

It's not currently possible to work around this, because cbor-x doesn't export decoder.js in package.json/exports and index.js/node.js files don't export getPosition or clearSource from Decoder so Streams cannot be more robustly implemented externally without forking/vendoring the package.

It would be great for cbor-x to provide a public interface to do partial reads, perhaps by exporting getPosition and clearSource from index/node or by adding decode.js to the package.json exports field.

DecoderStream should adopt at least one of three possible strategies to resolve this: 1) some similar libraries offer a { wrap: true } option with their transform stream constructors, which wraps every value in the object stream with an object { value: <any> }. 2) the stream could throw an error via this.destroy(), to make it explicitly incompatible with streams that have root level null, avoiding such streams appearing to end prematurely and possible messy errors from attempting to push more values after ending the stream with push(null) 3) the stream could push Symbol.for('null') instead of a real null value whenever the next object is a literal null.

Option 1 seems most common, and is able to support older platforms that might not have Symbol.for available, but option 3 likely has better performance, by avoiding allocating an extra object for every push.

Bluebie commented 3 years ago

Another option, which also uses symbols internally, is providing something like a nullcheck method on the DecoderStream interface. This is used in node-cbor: http://hildjj.github.io/node-cbor/Decoder.html#.nullcheck. I'm not a fan of this. It feels messy. I don't know why they're also using symbols to proxy for root level undefined values. Maybe there's a different streams spec which treats undefined specially, but the node:streams don't seem to and the docs only call out null as a special value.

I think it's also worth considering if all this is really worth it. It is common for codecs like this to offer object streams as a solution for this sort of thing, but it might be better to offer a pair of async generators instead. Generators do not suffer from the issue around null having special meaning, and these days they're very interoperable with node streams. They can participate in node streams.pipeline() lists, and all readable streams can be consumed using for await (val of iter), so it's almost isomorphic with streams in a lot of modern code anyway.

A simple implementation of streaming using async generators looks something like this:

/**
 * Create a readable node buffer stream, from an iterable source
 * @param {Array|Iterable|IterableIterator|Generator|AsyncIterable|AsyncIterableIterator|AsyncGenerator}
 * @returns {Readable}
 */
export function iterableToStream (iterable) {
  async function * gen (iter) {
    for await (const entry of iter) {
      yield exports.encode(entry)
    }
  }
  return Readable.from(gen(iterable))
}

/**
 * Create an async iterable from a node stream of buffers or any other [async] iterable source of buffers
 * @param {Readable}
 * @yields {any}
 */
export async function * streamToIterable (readable) {
  const decoder = new Decoder()
  let incomplete

  for await (let chunk of readable) {
    // if there's incomplete data from previous chunk, concatinate and try again
    if (incomplete) {
      chunk = Buffer.concat([incomplete, chunk])
      incomplete = undefined
    }

    let position = 0
    const size = chunk.length
    try {
      yield decoder.decode(chunk, size, true)
      position = getPosition()
      while (position < size) {
        yield decoderModule.read()
        position = getPosition()
      }
    } catch (err) {
      if (err.incomplete) {
        incomplete = chunk.slice(position)
      } else {
        throw err
      }
    } finally {
      clearSource()
    }
  }
}
kriszyp commented 3 years ago

Thank you for the great overview of the options, I will take a look at these, and maybe do some benchmarking of the options.

kriszyp commented 3 years ago

I added your suggestion of using Symbol.for(null) for null for now. But adding async generators/iterators seems like it might be a nice addition. Do you want to put your async generator code into a PR for commit accreditation?

Bluebie commented 3 years ago

I'll have a play around on my fork and see if I can make something nice with the iterator stuff :)