hugomrdias / rabin-wasm

Rabin fingerprinting implemented in WASM
28 stars 7 forks source link

Question: Is there reason to return multiple sizes ? #206

Closed Gozala closed 2 years ago

Gozala commented 2 years ago

Hey @hugomrdias I've been trying to define a new chunker interface which looks something like this:

/**
 * Chunker API can be used to slice up the file content according
 * to specific logic. It is designed with following goals in mind:
 * 
 * 1. Stateless - All the state manangement is handled by the consumer, meaning
 *    it is consumers responsibilty to slice from the buffer to get a new slice.
 *
 * 2. Effect free - Since chunker does not read from the underlying source
 *    consumer is free to perform multiple calls while moving buffer offset
 *    or it could read more bytes and perform reads afterwards.
 *
 * 3. Doesn't manage resources - Chunker does not manage any resources, this
 *    guarantees that chunker can not use more memory than desired by consumer.
 */
export interface Chunker<T extends Readonly<unknown>> {
  /**
   * Context used by the chunker. It usually represents chunker
   * configuration like max, min chunk size etc. Usually chunker implementation
   * library will provide utility function to initalize a context.
   */
  readonly context: T
  /**
   * Chunker takes a `context:T` object, `buffer` containing bytes to be
   * chunked and `ended` flag that tells it if more bytes could be made
   * available in the followup calls. Chunker is supposed to return positive
   * integer constituting number of bytes (from the start of the buffer)
   * that contain next chunk. If returned number is `0` that signifies that
   * buffer contains no valid chunks. Returning negative numbers is not allowed.
   *
   * **Note:** Chunker MAY return `0` even if `ended && buffer.byteLength > 0`,
   * it is consumers responisibility to handle remaining bytes, despite it not
   * been a chunk.
   */
  cut(context:T, buffer:Uint8Array, ended:boolean):number
}

However implementation here seems to eagerly collect all chunks as opposed to providing an API to do it step by step

https://github.com/hugomrdias/rabin-wasm/blob/f0cf7ce248a268cc65c389ece6882df25f92fc02/assembly/index.ts#L153-L165

Also as far as I can tell original implementation did not do that, which makes me wonder if there was a specific reason API diverged here.

Gozala commented 2 years ago

After thinking about this bit more myself, I realize that it makes a lot more sense to return sizes for all chunks in one call as that would avoid WASM host from copying remaining bytes back into the start position of shared memory on subsequent calls. That is why I'm revising chunker API to reflect that:

/**
 * Chunker API can be used to slice up the file content according
 * to specific logic. It is designed with following goals in mind:
 * 
 * 1. Stateless - All the state manangement is handled by the consumer, meaning
 *    it is consumers responsibilty to slice from the buffer to get a new slice.
 *
 * 2. Effect free - Since chunker does not read from the underlying source
 *    consumer is free to perform multiple calls while moving buffer offset
 *    or it could read more bytes and perform reads afterwards.
 *
 * 3. Doesn't manage resources - Chunker does not manage any resources, this
 *    guarantees that chunker can not use more memory than desired by consumer.
 */
export interface Chunker<T extends Readonly<unknown>> {
  /**
   * Context used by the chunker. It usually represents chunker
   * configuration like max, min chunk size etc. Usually chunker implementation
   * library will provide utility function to initalize a context.
   */
  readonly context: T
  /**
   * Chunker takes a `context:T` object, `buffer` containing bytes to be
   * chunked. Chunker is expected to return array of chunk byte lengths (from
   * the start of the buffer). If returned array is empty that signifies that
   * buffer contains no valid chunks.
   *
   * **Note:** Consumer of the chunker is responsible for dealing with remaining
   * bytes in the buffer when end of the stream is reached.
   */
  cut(context:T, buffer:Uint8Array):number[]
}