ipld / js-car

Content Addressable aRchive format reader and writer for JavaScript
Other
46 stars 7 forks source link

feat: buffered writer #70

Closed Gozala closed 2 years ago

Gozala commented 2 years ago

Implement #69

rvagg commented 2 years ago

Yeah, this is neat, I like the API I think. I like that you could use the class directly to write into your own buffer too. It's just the issues around size estimation that are a concern. That offset shuffle is a bit nasty and it would be nice if we didn't have to have it since this code is going to so much effort to get it right. In fact I'd also be fine with making the API strictly require you estimate correctly ahead of time to avoid this invisible nastiness entirely.

The other issue I see with mis-estimation is that I you want to use the class directly on your own buffers, then if you overestimate your header size then you get back a subarray that doesn't start at the begining of your buffer - which may be unexpected. Not a big deal but maybe you want to write a CAR directly up against some other data already in your buffer and to avoid the padding that an overestimation will create you'll have to get your estimates spot-on - and it'd be tricky to tell if you got it wrong. Another case for just removing the possibility of getting it wrong - throw if you misestimate.

Gozala commented 2 years ago

Yeah, this is neat, I like the API I think. I like that you could use the class directly to write into your own buffer too. It's just the issues around size estimation that are a concern. That offset shuffle is a bit nasty and it would be nice if we didn't have to have it since this code is going to so much effort to get it right. In fact I'd also be fine with making the API strictly require you estimate correctly ahead of time to avoid this invisible nastiness entirely.

Yeah moving bytes isn't ideal indeed. Maybe there needs to be a separate API for one use case we have that is:

  1. We encode user input into CAR packets of 256MiB or less.
  2. We do not know how many roots there will be, as it depends on input

With the current API thinking is you could make a rough guess for amount of roots and start writing, if you have less roots you may be able to make space for more blocks, if you have more roots you may be able to make room for them by reducing number of blocks.

Unfortunately I can not think of any better way to do this, other than writing blocks from the end of the buffer, but that is not great either because order in which we encode is it's natural order. The way I think we may go about it is via single synthetic root linking to desired roots, that way synthetic root could be the last one and we could make pretty accurate estimates.

Anyway if you think it's best to avoid moving bytes in close, I can drop that and leave it up to use space to deal with that code. In that case it would be nice to have a helper function that can at least copy stuff into larger buffer without having to reencode everything on wrong estimate.

Gozala commented 2 years ago

The other issue I see with mis-estimation is that I you want to use the class directly on your own buffers, then if you overestimate your header size then you get back a subarray that doesn't start at the begining of your buffer - which may be unexpected.

I am not sure it is that bad, although I agree it can be unexpected. Alternatively we could slide bytes blocks towards the head, but I thought I'd leave it up the user to decide. What do you think ?

Not a big deal but maybe you want to write a CAR directly up against some other data already in your buffer and to avoid the padding that an overestimation will create you'll have to get your estimates spot-on - and it'd be tricky to tell if you got it wrong.

I'm not sure it would be tricky, you just have to compare the byteOffset, but I do agree that one may not expect it to be different. For what it's worth that was a reason you pass ArrayBuffer and get back Uint8Array, hope was that this would make things more clear.

Another case for just removing the possibility of getting it wrong - throw if you misestimate.

I don't like throwing idea here because you did mutate the buffer but then you threw and left buffer dirty.

I think here are several alternative options to consider:

  1. We pad with raw block just to align things.
  2. We move blocks instead to align.
  3. We move header to the front to align.

I felt like 3rd option was best as it has least overhead and user could still revert back to 1 or 2 if so desired.

rvagg commented 2 years ago

I don't like throwing idea here because you did mutate the buffer but then you threw and left buffer dirty.

yeah, good point, but that might be an argument for ensuring that you can properly get your header length estimate right, or at least never underestimate. If we can make the API solid in those terms, i.e. if you use it properly then you'll never underestimate, then it becomes a user problem - don't give bad estimation values up-front or you'll end up in a bad place (in which case I still don't mind throwing, you made a boo-boo, consider it programmer error and therefore reasonably fatal).

Gozala commented 2 years ago

yeah, good point, but that might be an argument for ensuring that you can properly get your header length estimate right, or at least never underestimate. If we can make the API solid in those terms, i.e. if you use it properly then you'll never underestimate, then it becomes a user problem - don't give bad estimation values up-front or you'll end up in a bad place (in which case I still don't mind throwing, you made a boo-boo, consider it programmer error and therefore reasonably fatal).

Ok so I have spend bit more time on this and I think I have reasonable solution

  1. I have addressed header estimate issues you've caught (thanks for pointing those out)
  2. I made close throw exception if estimates is incorrect.
  3. I have added an option to close close({ align: true }) which if passed will move bytes when possible and ensure that head starts at the beginning of the buffer and is directly followed by the blocks.

I think that is reasonable compromise, as it fails by default but provides a way to recover when possible.

rvagg commented 2 years ago

👌 nice

re my last comment about CID sizing - it seems like low hanging fruit now you have the logic partially done, but I wouldn't consider it essential; we should just identify this as something to fix later one when we get a proper CBOR sizer and add it in here (I wrote one today for cborg but ran into a nasty bug in the cbor encoder triggered in the test suite that got me side tracked).

Gozala commented 2 years ago

@rvagg I have merged your #71 and refactored this PR to remove prior calculations. I have also changed estimateHeaderSize to estimateHeaderLength(rootCount:number, rootByteLength?:number):number as you've suggested.

I think with that all the issues should be addressed now

rvagg commented 2 years ago

oh, the ts compiler doesn't like some of my @param additions for docs, those can be removed but it might be nice if you could figure out a compromise to get params documented a little bit nicely

github-actions[bot] commented 2 years ago

:tada: This PR is included in version 4.1.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket: