kriszyp / cbor-x

Ultra-fast CBOR encoder/decoder with extensions for records and structural cloning
MIT License
279 stars 32 forks source link

[Feature]: Generate blob parts to support embedding blobs (aka files) #57

Closed jimmywarting closed 1 year ago

jimmywarting commented 1 year ago

scroll down to my 3rd comment: https://github.com/kriszyp/cbor-x/issues/57#issuecomment-1335770916

Original post if i have something that needs to be read asyncronus or with a stream can i do that then? I'm thinking of ways to best support very large Blob/Files tags... Here is some wishful thinking: ```js import { addExtension, Encoder } from 'cbor-x' let extEncoder = new Encoder() addExtension({ Class: Blob, tag: 43311, // register our own extension code (a tag code) encode (blob, encode) { const iterable = blob.stream() // returns a async iterators that yields Uint8Arrays encode(iterable); // return a generator that yields uint8array } async decode (readableByteStream) { const blob = await new Response(stream).blob() return blob } }) ```
kriszyp commented 1 year ago

The cbor-x is generally synchronous because there are significant performance regressions involved in making things asynchronous. However, one idea is that you could possibly allocate buffer space in an encoding and then stream data into (and out of?) that allocating space (assuming you know the size a priori). Also are you potentially looking for a way to not hold the entire encoding in memory at once (treat it as a stream)? That often goes hand in hand with the need for asynchronicity.

jimmywarting commented 1 year ago

are you potentially looking for a way to not hold the entire encoding in memory at once (treat it as a stream)?

yes, hopefully (if i can)


Maybe ( if it works ) i could do something like:

/** @type {WeakMap<FakeUint8Array, Blob>} */
const wm = new WeakMap()

class FakeUint8Array extends Uint8Array {
  constructor(blob) {
    const noopUint8 = super(0)
    wm.set(this, blob)
    this.length = this.byteLength = blob.size
  }

  stream() {
    return wm.get(this).stream()
  }
}

addExtension({
    Class: Blob,
    tag: 43311, // register our own extension code (a tag code)
    encode (blob) {
        encode(new FakeUint8Array(blob))
    }
    // TODO: Still need some work...
    async decode (readableByteStream) {

    }
})

And then my transform streamer (or iterator) could intercept all of the chunks that comes through. and if it encounters a FakeUint8Array then it could halt my own (async) iterator (or stream) and yield all the blob's chunks?

It seems like something worrisome... don't know what might go on with TypedArray set, slicing, copying etc...

Preferable i would like to say, here is an ArrayBuffer token/tag and the length is 1024 bytes, and here is a async "read" method... then cbor would just write a normal byte array (include the length of the buffer) and use my own reader.

cbor-x is generally synchronous because there are significant performance regressions involved in making things asynchronous.

Maybe there are a way for them to co-exist? the cbor-x could still be synchronous but i would have to make sure that my own iterator halts. maybe i could call a next() and a write() function of some sort? it's possible to take a sync iterator and make it into a hyprid async + sync iterable by wrapping it

So it would be something like:

// if it's a promise then you have to wait for it to finish
for (let chunk of cbor.decode_as_iterable(anything)) {
  if (chunk instanceof Promise) chunk = await chunk
  if (chunk instanceof Uint8Array) yield chunk
  else {
    for await (const uint8 of chunk) yield uint8
  }
}
jimmywarting commented 1 year ago

I now know how i exactly want this to work instead!

Blob object represents a blob, of immutable, raw data; Blobs can represent data that isn't necessarily in a JavaScript-native format. and can expanding it to support on the user's system. if we remove the facts that blob/files has a last modified date, filename and a mime type attribute then Blobs are very similar to ArrayBuffer; except that they are not mutable and you can't read the content of it right away. so Blobs and ArrayBuffer is pretty much the same thing!

image

So when you get a file from eg <input type="file"> and doing something like

const file = input.files[0] 
console.log(file.size) // 2 GiB
const blob = new Blob([file, file])

// 4 GiB (but you still haven't allocate any memory)
// blob just now holds 2 references point to where it should read the content from
console.log(file.size) // 

So, I really don't want to read/allocate memory of the content of the blob if i don't need to.

When i want to embed a file/blob into a cbor payload then i would like it to just simply be the same thing as if i did

const file = await getFileFromSomewhere()
const arrayBuffer = await file.arrayBuffer()

cbor.encode({ content: arrayBuffer })
// would be the same thing as if i did 
cbor.encode({ content: file })

...So no special tag attribute that says it's a Blob or a File i just want blobs to have the same token and length metadata as normal ArrayBuffers (there should be nothing special about a tag that says it's either a blob of a file)

So when i would encode this:

uint8 = cbor.encode({
  content: new Uint8Array([97,98,99]).buffer
})
/*
Then i would get back:
b9000167636f6e74656e7443616263

9 0001              # map(1)
   67                #   text(7)
      636f6e74656e74 #     "content"
   43                #   bytes(3)
      616263         #     "abc"
*/
// Same result could be produced using a blob instead:
cbor.encode({
  content: new Blob([ new Uint8Array([97,98,99]).buffer ])
}).toString('hex') === 'b9000167636f6e74656e7443616263'

but instead of giving me back a single large buffer and having to read the content of the blob then i would like to get back blob parts that can be an array of either Uint8Arrays or blob parts so i would get back something like

const uint8 = new Uint8Array([97,98,99]) // abc
const blob = new Blob([ uint8 ])
const chunks = cbor.encode({ content: blob })

console.assert(chunks.length === 2, 'number of chunks getting back from encoding is 2')
console.assert(chunks[0] instanceof Uint8Array, 'first piece is a uint8array')
console.assert(chunks[1] === blob, 'the 2nd piece is the same blob i encoded with')

the result of encoding cbor.encode({ content: blob })would be

[
  new Uint8Array([ 185, 0, 1, 103, 99, 111, 110, 116, 101, 110, 116, 67 ]),
  new Blob([ 'abc' ])
]

And nothing would ever have to be read into memory and not even have to allocate any memory for those blobs. you would get a ultra fast encoder for adding things such as blobs into the mix.

now it would ofc be up to you to send this response payload to a server or a client and/or save this somewhere and how could you do it? simple: just do something like

const blob = input.files[0]
// console.log(await blob.text()) // abc
const blobParts = cbor.encode({ content: blob })
console.log(Array.isArray(blobParts)) // true
const finalCborData = new Blob(blobParts)
// const arrayBuffer = await finalCborData.arrayBuffer()
fetch('/upload', { method: 'POST', body: finalCborData })

the result would be something like: image

jimmywarting commented 1 year ago

Sry to write sooo much, but wanted to over simplify stuff to tell you what i want to happen. I want this logic sooo badly right now!

Blob support exist now in both Deno and NodeJS and for NodeJS i have also built this fetch-blob library that lets you get blobs from the file system without ever having to read the content of the files upon till you really need to read the content. so you can do things like blobFromSync(filePath).slice(1024, 1024 + 100) and nothing is ever read into memory. I plan on using this pkg and cbor together

NodeJS plans on impl something where you are able to get a blob from the file system...

jimmywarting commented 1 year ago

The result could even be something like

[ subArray1, blob, subArray2 ] // blob parts
subArray1.buffer === subArray2.buffer

where subArray uses the same underlying arraybuffer

kriszyp commented 1 year ago

This does seem like a good idea. I think one problem with this approach is that I don't necessarily like the type inconsistency of having encode usually return a Buffer except when there is a Blob somewhere in the object graph. Instead, I think it might make more sense to have an encodeAsBlob function:

const blobWithCbor = cbor.encodeAsBlob({ content: blob });

and then blobWithCbor would be the same as the resulting blob you described:

new Blob([
  new Uint8Array([ 185, 0, 1, 103, 99, 111, 110, 116, 101, 110, 116, 67 ]),
  new Blob([ 'abc' ])
]);
jimmywarting commented 1 year ago

Yea, i would like that.

I don't necessarily like the type inconsistency of having encode usually return a Buffer except when there is a Blob somewhere in the object graph. Instead, I think it might make more sense to have an encodeAsBlob function

yea, i figured as such. i just wanted to describe what i wanted to be able to do. And after we could discus the method name, and or options to the constructor to Encode would accept.

I can give a few suggestions:

opt 1

I was thinking of maybe: what if instead of a blobParts array we instead returned a Blob object directly? There are some web api's that can return blobs (and never a blobParts array)

offscreenCanvas.convertToBlob(), canvas.toBlob() response.blob(), request.blob(), fileHandle.getFile() so maybe

blob = cbor.encodeAsBlob({ content: blob })

opt 2

if we are going to return a

new Blob([
  new Uint8Array([ 185, 0, 1, 103, 99, 111, 110, 116, 101, 110, 116, 67 ]),
  new Blob([ 'abc' ])
])

then maybe a better name for it would be cbor.encodeAsBlobParts() as to not confuse it with returning a Blob

opt 3

if we are going to have a new name for encoding to this new format. why not just go and ahead for something that's more stream/RAM friendly right from the bat? so instead of returning either a array or a blob as an array make something that yields the results from a generator function?

// Somewhere in core
cbor.encodeAsBlobIterator = function * (content) {
  while (encoding) {
    yield blobPart
  }
}

const iterable = cbor.encodeAsBlobIterator({ content: blob })

// stream.Readable.from(iterable).pipe(dest) // node solution
// globalThis.ReadableStream.from(iterable).pipeTo(dest) // web stream solution
// plain iterator

for (const blobPart of iterable) {
  // blobPart would be either a ArrayBuffer, uint8array or blob
  // or even a subArray of an uint8array
}

This way you can also potentially reset the allocated buffer and reuse the same buffer, and reset the offset to zero again.

chunks = []
for (const blobPart of iterable) {
  if (blobPart instanceof uint8array) {
    chunks.push(blobPart)
  } else {
    // it's a blob
  }
}
chunks[0] === chunks[1] // true

this ☝️ would maybe be confusing for most users as they would have to manually slice it themself if they need to... but it would be a good performence win. also with the new iterator helpers propsals you wouldn't be able to do iterator.toArray() for instance. instad you would have to use a .map() that slice the chunk.

kriszyp commented 1 year ago

I am not sure I understand the difference between your opt 1 & 2. I was suggesting that I return a Blob instead of an array (that can be passed to the Blob constructor), and that looks your opt 1 & 2 (opt 2 is showing the array passed to the Blob, but still returns a Blob).

And yes, opt 3 would be really good, and I have other uses for this well (with embedded iterators). This is certainly more complicated though. I am doubtful that I can convert the encode code to generators without significant performance regression (for current encodings). I think it may be viable to have a mechanism that detects embedded blogs/iterators, and throws and reruns with specific knowledge of where a generator is needed (and caches the plan), but again, this will require some experimentation. Also, I don't think the iterator would need to return anything other than Uint8Arrays (that's what transitively streaming embedded Blobs would yield, at least by default, right?)

jimmywarting commented 1 year ago

opt 1 would return a blob directly and you wouldn't have to pass it into a blob constructor to get a blob opt 2 is just an other name suggestion - it would return a array with a mix of both uint8arrays and blob segments

the iterator would need to return anything other than Uint8Arrays (that's what transitively streaming embedded Blobs would yield, at least by default, right?)

don't know what "transitively streaming" means

iterators don't return anything, they yield stuff. but yea i guess they would yield uint8arrays. it could also yield Blobs so you don't need to read the content of the file (which is what i want)

here are some psudo code:

/**
 * @return {Iterable<Uint8Array | Blob>} returns an iterator that yields either a uint8array or a blob
 */
cbor.encodeAsBlobPartsIterator = function * (content) {
  let offset = x
  ...

  if (current_value instanceof Blob) {
    // Append bytes that says it's a byte array and what the size is
    current_serialized_uint8array.set(...)
    // yield what we already have in store
    yield current_serialized_uint8array.subArray(x, y)
    // yield a blob or a file (could also do `yield current_value.slice()`, but what's the point?
    yield current_value
    // reset the offset to 0 and start filling in the buffer again to reuse an already allocated arrayBuffer
    offset = 0
  }

  ...
}

ofc you could do something like:

    // stream the blob
    yield * current_value.stream()

Then it would only yield Uint8Arrays but then that also means that the generator function would also have to be async but this is not really what i would want, and i don't think you want that either. i would rather have a generator that yields both uint8arrays and also blobs when it encounters a such value.


doubtful that I can convert the encode code to generators without significant performance regression (for current encodings)

are generators really that slow? haven't bench tested it or anything... maybe it could be done with callbacks instead? at least with generators you then have the option to pause / resume / abort the logic if you maybe encounter something error like a quota exceeded, or maybe a websocket bucket may be overflowing. streams do also have a buckets and desiredSize

I think generators adds up in functionality and extra features in the dispense of some performance loss. it makes it easy to create readable streams out of iterators when you need them.

jimmywarting commented 1 year ago

@kriszyp do you have any plans on start working on this? otherwise i might take it on and sending you a PR... if you give me some pointers of how to go about it. (haven't looked at your source code and contribute anything before)

kriszyp commented 1 year ago

I do have plans to start working on this (actually already have started a little bit), as we actually need similar functionality for new features at my job too (more for encoding from iterators, but will be treated roughly the same as blobs). Of course you are welcome to take a stab at it. FWIW, #61 would also make a helpful PR if you think have a (separate) module to do re-sorting of objects would be useful.

jimmywarting commented 1 year ago

cool, then i will wait for a new feature release 👍 🙂 Ping me when you have a PR ready, i could review it if you like 😉

kriszyp commented 1 year ago

You can take a look at the associated commit for my first pass at an implementation (there is some cleanup to do, but I believe it basically works).

jimmywarting commented 1 year ago

tested it out just now. Like what I saw so far.

it's able to handle blob's just fine as i wised for.

just don't know if it would be expected for something like

const blob = new Blob(['abc'])
encodeAsIterator([blob])
encodeAsAsyncIterator([blob])

to produce uint8, blob, uint8 vs uint8, uint8, uint8, uint8 as one contains a blob and the other dose not... but i suppose that is fine...


a bit of a concern about this when i saw // start indefinite array https://github.com/kriszyp/cbor-x/blob/3f2e2a28afa5a3e42c9f5345c7884791c03aad02/encode.js#L762-L780 and having that ☝️ at the very top.

Many things have Symbol.iterator (even array's string, Map and Set, typed arrays) those things have a known length.

so the following provides two different results:

const it = encodeAsIterator([123])
for (let chunk of it) console.log(chunk)

console.log('----')

console.log(encode([123]))
# output:

Uint8Array(1) [ 159 ]
Uint8Array(2) [ 24, 123 ]
Uint8Array(1) [ 255 ]
----
Uint8Array(3) [ 129, 24, 123 ]

I do not see any reason why the iterator could not produce the same result as encode([123])... but as:

Uint8Array(3) [ 129 ]
Uint8Array(3) [ 24, 123 ]

...or even one single uint8array. (as long as it have some room for it). Maybe it would be better to to move the if (object[Symbol.iterator]) { further down?


this one was a bit unexpected:

const blob = new Blob(['abc'])
// const ab = await blob.arrayBuffer()
const sync = encodeAsIterator(blob)
const asy = encodeAsAsyncIterator(blob)

for await (const c of sync) console.log(c)
console.log('---')
for await (const c of asy) console.log(c)

i got:

Blob { size: 3, type: '' }
---
Uint8Array(1) [ 67 ]
Uint8Array(3) [ 97, 98, 99 ]

but i expected to get

Uint8Array(1) [ 67 ]
Blob { size: 3, type: '' }
---
Uint8Array(1) [ 67 ]
Uint8Array(3) [ 97, 98, 99 ]

I think you forgot to do this in the sync iterator as well: https://github.com/kriszyp/cbor-x/blob/3f2e2a28afa5a3e42c9f5345c7884791c03aad02/encode.js#L829

jimmywarting commented 1 year ago

there is just one thing i'm wondering about... right now when i use encodeAsIterator then it yields quite a lot of small uint8 subarrays (but with the same underlying arraybuffer ofc)... Are there any reason why it can't yield a single larger concatenated uint8array?

I bet somethings could been speed up if cbor just gave me one single uint8array that could include many values at once. But i understand if it didn't do that... then it would be more like the way "replacer" works in JSON.stringify(value, replacer). You get a value as fast as it's available so it should be more stream friendly and pause-able. I suppose there are pro/cons to both of the methods of yielding everything as soon as it is ready and only when it needs to make more room.

jimmywarting commented 1 year ago

just taking a quick glance of the commit it feels like quite much logic is duplicated. i bet there could be some smart ways of combining the two into one.

The async iterator could just be a "Map" transformer. Something as simple as:

async function * encodeAsAsyncIterator(value) {
  const syncIterator = encodeAsIterator(value)

  for (const chunk of syncIterator) {
    if (chunk instanceof uint8array) yield chunk
    else if (chunk instanceof Blob) {
      yield * readableStreamToIterator(blob.stream()) 
      // or just yield * blob.stream() if it where supported... (only works in NodeJS atm)
      // or alternative add a polyfill for RedableStream[Symbol.asyncIterator]
    }
    else {
      // or maybe it's a read function that return a asyncIterator? where something could have `yield blob.stream`
      // const read = chunk
      // yield * read()

      // or maybe it's a other async iterator
      // const reader = chunk
      // yield * reader
    }
  }
}

fyi, here is a nice polyfill: https://github.com/ThaUnknown/fast-readable-async-iterator/blob/main/index.js

kriszyp commented 1 year ago

a bit of a concern about this when i saw // start indefinite array

Yes, you are right, I had not intended to use indefinite length for arrays, that should be fixed.

I think you forgot to do this in the sync iterator as well:

Yes, that should be fixed.

there is just one thing i'm wondering about... right now when i use encodeAsIterator then it yields quite a lot of small uint8 subarrays (but with the same underlying arraybuffer ofc)... Are there any reason why it can't yield a single larger concatenated uint8array?

Yes, you are right, that was very inefficient. The latest version should be much more efficient about collecting bytes and returning them in larger chunks.

just taking a quick glance of the commit it feels like quite much logic is duplicated. fyi, here is a nice polyfill: https://github.com/ThaUnknown/fast-readable-async-iterator/blob/main/index.js

If the goal is to reduce unnecessary code, it doesn't seem like the 12 lines of code in the polyfill is an improvement on my 5 lines of code. Or is there something else you are wanting here?

jimmywarting commented 1 year ago

👍

it doesn't seem like the 12 lines of code in the polyfill is an improvement on my 5 lines of code. Or is there something else you are wanting here?

Hmm, discard that, just saw that you do kind of ish what i already suggested.

        async function* encodeObjectAsAsyncIterator(value, iterateProperties) {
            for (let encodedValue of encodeObjectAsIterator(...)) {
jimmywarting commented 1 year ago

I'm happy at where it's at. Guess the only thing left to do is to write some documentations for how to use it.

jimmywarting commented 1 year ago

Now this is very simple to encode large files in a cbor format. it makes it very easy to write some kind of Tar, Zip archive format where you can just concat lots of files together.

new Blob([ ...encodeAsIterator( fileList) ]).stream().pipeTo(dest)

fetch('/upload', { method: 'post', body: new Blob([ ...encodeAsIterator( fileList) ]) })

It's a good alternative way to replace the old FormData that needs a boundary. The decoding process of Multipart / FormData payloads requires scanning for start/end boundary instead of simply just letting x amount of bytes passing through. Decoding would be a lot easier/faster if just all entries had a extra content-length header for each entry. And you would also know how large each file is before processing each field. i know b/c i have worked with node-fetch and FormData for nodejs, and also had a helping hand with busboy and formidable.

decoding formdata is pretty easy in nodejs v18 now that you no longer need any library to decode formdata payloads it's possible to just do:

const fd = await new Response(incommingReq, { headers: incommingReq.headers }).formData()
for (const entry of fd) { ... }
jimmywarting commented 1 year ago

Now decoding cbor with a 4+ GiB large files on the other hand. Could that be improved somehow? I bet you could not just simply return a ArrayBuffer (as it's capped)

i see two options: when you decode something, then you could either:

thinking something like JSON.parse(str, retriver)

const ab = new ArrayBuffer(1024)
const blob = new Blob([ab])
const cborPayload = new Blob([ encode({ content: blob }) ])

// design 1, solution of reading a blob
decode(cborPayload, (t) => {
  if (t.token === 'byteArray') {
    // return a slice of the original cborPayload
    // and skip reading `t.size` --- sets: t.offset += t.size 
    return cborPayload.slice(t.offset, t.offset + t.size)
  }
}).then(result => { ... })

// design 2, solution of reading a blob
decode(cborPayload, async (t) => {
  if (t.token === 'byteArray') {
    const iterator = t.createReadableIterator()
    const rs = new Readable.from(iterator)
    const root = await navigator.storage.getDirectory()
    const fileHandle = await root.getFileHandle(t.key)
    const wr = fileHandle.createWritable()
    await rs.pipeTo(wr)
    return fileHandle.getFile()
  }
}).then(result => { ... })

I do not know... maybe you want to read a stream or iterator instead...

decode(cborPayload.stream(), async (t) => {

Either way... some kind of way to fuzzy search, jump skip reading x amount of bytes would be a cool lower level solution that wish to have some kind of way to search inside of cbor

I'm just shooting out ideas. maybe it should be dealt with using custom tags instead of using something like json parser retriver.

but this is an topic for anther issue/feature idea.

jimmywarting commented 1 year ago

realize i should probably try to write my own tag extension to really learn how cbor-x works and what it's catable of. haven't done that yet. and i probably should.

think i want to try and write a tag-extension for Blob/File representation now. kind of doing:

addExtension({
    Class: File,
    tag: 43311, // register our own extension code (a tag code)
    encode (file, encode) {
        // define how your custom class should be encoded
        encode([file.name, file.lastModified, file.type, file.slice()]);
    },
    decode([ name, lastModified, type, arrayBuffer ]) {
        // define how your custom class should be decoded
        return new File([arrayBuffer], name, { type, lastModified } )
    }
});

haven't tried this ☝️ yet but i assume that's how you write extensions bet you would be able to overwrite how byte arrays (ArrayBuffer) how they are decoded. maybe you would be able to return a Blob instead when decoding.

jimmywarting commented 1 year ago

Would it be a circular problem if i tried to write something like this now that cbor-x support encoding blob as per this particular new feature that you have implemented?

addExtension({
    Class: Blob,
    tag: 43311, // register our own extension code (a tag code)
    encode (blob, encode) {
        encode([ blob.type, blob.slice() ]);
    },
    decode([ type, arrayBuffer ]) {
        return new Blob([arrayBuffer], { type } )
    }
})
kriszyp commented 1 year ago

The encodeAsIterator (and encodeAsAsyncIterator) should be published in v1.5.0.

And yes, I think it would be nice to eventually support iterative decoding in the future as well, which could allow for decoding stream/iterators with >4GB of data (and I think another valuable use case could be progressively decoding remote content without having to wait for all data to download). And yes, it would make sense that if you were decoding a stream or blob, that any embedded binary data would likewise be returned as a stream or blob.

jimmywarting commented 1 year ago

decoding remote content without having to wait for all data to download

As in partial http range request 🙂

jimmywarting commented 1 year ago

Hi again. i got one small request if you wouldn't mind. i tried using my fetch-blob impl. and the issue with it is that's more arbitrary then other native blob implementations (in the way that you can create blob's that are backed up by the filesystem or any blob look-a-like) so they are not really instances of NodeJS own Blob instances.

So therefore this constructor === BlobConstructor will always be false when using fetch-blob or any other blob polyfill for that mater.

i have tough about extending native NodeJS built in blob and override all properties. but i can't really do that.

So i was wondering if you could maybe do duck type checking to check if it matches a blob signature?

import { Blob } from 'node:buffer'
import { blobFromSync } from 'fetch-blob'

/** @returns {object is Blob} */
const isBlob = object => /^(Blob|File)$/.test(object?.[Symbol.toStringTag])

const readme = blobFromSync('./package.json')

isBlob( readme ) // true
isBlob( new Blob() ) // true

readme instanceof Blob // false

I do really wish https://github.com/nodejs/node/issues/37340 got resolved ot that something like Blob.from(...) ever become a solution

jimmywarting commented 1 year ago

fyi, i just want to share that NodeJS have shipped fs.openAsBlob(path, [options])