Open sffc opened 2 years ago
I would rather not make it dynamically configurable, since we lose out on the equality property: two VZVs in such a scheme may be logically equal even if not byte-equal. Static configurability with a default seems fine, iirc const generics don't support defaults yet but we can hardcode a constant and wait for that support to eventually exist to switch over.
And yeah, we should be more careful about the fallibility there.
Why can't we just have an invariant where the index width is always the smallest possible? If, when mutating, we pass a boundary either up or down, we just re-compute the index array.
Hmmm, tempting; though this does cause two problems:
Writing the mutation code to efficiently handle this will be really annoying, so I suspect we'll just have to use the existing mutation code and then iterate-reallocate. This isn't that big a deal, I guess.
I do think that the pathological case will be easier to hit for u8
lengths since it's easy to get to 256 on the entire buffer. What do you think of starting the lengths at u16?
We can also save the first byte by packing the length and the byte-width in the same number (this does lower our thresholds)
I think to start we should make N
a parameter on VZVBorrowed/VZVComponents and a constant, and then try to move forward from there.
Hmmm, tempting; though this does cause two problems:
- all operations on VZV get a little bit slower (this might be okay, but I'm very reluctant to introduce branches here)
I don't think this is a claim we can make until we have numbers. You're still calculating the same number of offsets; you just need to adjust them by a dynamic value loaded into a register rather than a constant in code. I don't see where the additional branching would be.
- mutation can get really expensive, especially if you're dancing around a boundary. This gives us pathological cases for ZeroMap.
As you know, I've been skeptical for a long time about a focus on mutation operations. If you want to do a whole lot of work, convert the ZeroMap to a LiteMap/BTreeMap, mutate over there, and then bring it back to the ZeroMap after you're done. Mutation operations within ZeroMap should be rare, and when the occur, they should be small.
Writing the mutation code to efficiently handle this will be really annoying, so I suspect we'll just have to use the existing mutation code and then iterate-reallocate. This isn't that big a deal, I guess.
Actually, all you need to do is either add an empty byte or remove an empty byte for every entry in the index array. The numerical values within the index array stay the same. It's still a lot of shifting bytes around, of course.
I do think that the pathological case will be easier to hit for
u8
lengths since it's easy to get to 256 on the entire buffer. What do you think of starting the lengths at u16?
I was quite excited about the prospect of a u8
VZV. We have a whole lot of cases that will fit. For example, a VarZeroVec<str>
containing 7 weekday names or 12 month names will in most cases fit in a u8
VZV.
We can also save the first byte by packing the length and the byte-width in the same number (this does lower our thresholds)
Perhaps. This really only works if we start at u16
because we need all the space in the first byte with u8
.
I think to start we should make
N
a parameter on VZVBorrowed/VZVComponents and a constant, and then try to move forward from there.
I agree that we can start here.
Note that I don't want to put in work and change APIs to be fallible if we eventually are going to do the auto-sizing approach. So we should keep this issue until we make the decision on that, even if we implement a const usize
type parameter in the mean time.
Okay, I'm convinced. Let's start with the baked const (const generic on VZVComponents
, baked const elsewhere), and try the dynamic approach after that.
We can stop on the way at a const generic approach but note that we can't currently do struct VZV<const N: usize = 1>
since defaults on consts are unstable, so I'd prefer we go directly to dynamic (using the const generic as a stepping stone for the implementation, perhaps, but never released)
I don't think this is a claim we can make until we have numbers. You're still calculating the same number of offsets; you just need to adjust them by a dynamic value loaded into a register rather than a constant in code. I don't see where the additional branching would be.
Ah! I was imagining we'd stuff the width into the first byte of the length. This is probably unnecessary.
As you know, I've been skeptical for a long time about a focus on mutation operations. If you want to do a whole lot of work, convert the ZeroMap to a LiteMap/BTreeMap, mutate over there, and then bring it back to the ZeroMap after you're done. Mutation operations within ZeroMap should be rare, and when the occur, they should be small.
That's fair. I do think it's nice for mutation to be convenient but this is an okay position too.
Actually, all you need to do is either add an empty byte or remove an empty byte for every entry in the index array. The numerical values within the index array stay the same. It's still a lot of shifting bytes around, of course.
Yeah, I hadn't realized that it only affects the length buffer. This should not be hard to implement since we can "preexpand" or "precontract" the length buffer and then call the code Alyssa wrote. It's also possible but probably not worth it to do it as a one-step thing.
I was quite excited about the prospect of a
u8
VZV. We have a whole lot of cases that will fit. For example, aVarZeroVec<str>
containing 7 weekday names or 12 month names will in most cases fit in au8
VZV.
Hmmm fair point.
We can stop on the way at a const generic approach but note that we can't currently do
struct VZV<const N: usize = 1>
since defaults on consts are unstable, so I'd prefer we go directly to dynamic (using the const generic as a stepping stone for the implementation, perhaps, but never released)
The const generic solution can be similar to what we did with FormattedString
pub struct VarZeroVecFlex<const N: usize> { ... }
pub type VarZeroVec = VarZeroVecFlex<4>;
yeah, i was thinking about that, and I was also thinking about how I'd just reduced the number of public VZV types :smile: (note that we'd need one of these for slice, owned, and cow). Could probably be done in an okayish way by only wrapping the cow one
An algorithm for expanding or contracting the index array can be written in O(N) time, with N = length of the byte buffer.
memmove
3 bytes from the hare pointer to the tortoise pointer. After you've looped over the entire index array, memmove
the entire data section to its new position. Every byte has been read from or written to exactly once.memmove
the data section to make room for the wider indexes, and then have two cursors walking backwards, copying each index. One additional step is required, which is that the hare cursor must ensure that the high bit is zeroed out.Another question is how exactly we define the boundary. I see two ways to define it:
We can squeeze more bytes out of definition 2, but it is harder to reason about.
Some edge cases:
Hmm. I wonder if we should have two width parameters: one for the width of the length field, and the other for the width of the index fields. There is no reason that these two need to be coupled. The length field can be a varint, since it is only one field.
In the VZV buffer:
I don't think it's worth attempting to save the 1-3 bytes we can save by optimizing for the length field having a different length
I don't think it's worth attempting to save the 1-3 bytes we can save by optimizing for the length field having a different length
Let's answer this question first.
If we make the cutoff be on the length of the data array, we can run into cases where there are more elements than the size of the data array (a vector of empty strings). It seems cleaner to decouple the two widths. This isn't about saving a few bytes.
For example, if you have a vector of 100,000 empty strings, you could allocate a 3-byte index array, or you could allocate a 1-byte index array, saving 200k of memory.
I acknowledge that this is an edge case, but it's food for thought to challenge our assumption of coupling these two numbers.
Another option would be to keep the number of elements fixed at 2^32 but allow the data array to scale in length. But then we still need to add fallible APIs.
Another option would be to keep the number of elements fixed at 2^32 but allow the data array to scale in length. But then we still need to add fallible APIs.
Yeah this is what I had thought the proposal was which is why I felt that it was saving a few bytes.
I'm okay with making things fallible: this was what I was envisioning from the start when you were talking about variable size leng. I don't care that much about the perf characteristics or API pleasantness of mutation.
I think the fallibility question is the big crux. Vec is not fallible, because it grows to however big it needs, which is as big as usize
supports. In order to make VarZeroVec infallible, we need to make both the length field and the index array variable-width. If we say that VarZeroVec is fallible, then everyone mutating a VarZeroVec needs to be careful.
In some sense, if VZV is fallible, then we should stick with compile-time sizing, and make developers carefully pick the correct width for their use case.
But if we want VZV to be infallible, then it should scale everything automatically.
Note that 64 bit VZV won't work on WASM anyway, which is why I've been comfy having it be panicky/fallible.
But also, the only stuff that becomes fallible that aren''t fallible already are the mutation operations.
That seems ... fine to me? I guess it impacts ZeroMap too, which is unfortunate, but I'm okay with ZeroMap mutation panicking at 32 bit boundaries. Or even being fallible, though that sounds like it will grant us little benefit.
Basically, even if we make it stretchy, the panics will be replaced by OOMs on WASM.
If VZV is fallible, then we should make ZeroMap fallible, too. I'm not comfortable making VZV fallible but not ZeroMap.
I don't know how concerned I am with OOMs. That's something that can happen on any platform, not just WASM (although it's more likely on WASM since it's 32-bit). A discussion about how we should handle OOMs is the topic of a different thread.
Ultimately there are two options I'm okay with:
If VZV is fallible, then we should make ZeroMap fallible, too. I'm not comfortable making VZV fallible but not ZeroMap.
I'm fine with both being fallible.
A discussion about how we should handle OOMs is the topic of a different thread
I don't quite agree: given that the primary use case for these structures is deserialization, being able to create VZVs that will fail to deserialize on some platforms seems like a footgun, and I think that that footgun also exists for ZeroVec. OOMs in general, yes, should be a separate topic, but this is about being able to reliably cause a deserialization failure.
Note that in Rust code this is not actually an OOM, a too-large sequence will typically cause the deserializer to return an error, but it's kind of moot anyway because if you have that much data then getting that data into memory will OOM.
While we can't prevent all such deserialization OOMs, I think it should be a data model error for a zerovec type that does not have a u32 byte length to be created. This would probably mean all mutation and construction operations become fallible, which is unfortunate, but neither are things we particularly care about the performance or ergonomicness of. To some extent I would even be okay with these panicking instead since ICU4X client code would be unlikely to ever be mutating these, but I understand there is some reluctance to that.
Ultimately there are two options I'm okay with:
1. VZV parameterized by the size of the index array, and all relevant mutation operations are fallible, and ZM becomes fallible 2. VZV is fully flexible and infallible (OOM aside), and ZM remains infalible
I prefer (1) since it does not incur any additional cost on the reader. It does incur an ergonomic cost on the writer, which is unfortunate, and I'm not super happy about that so I wouldn't be too opposed to (2).
(It's worth checking to see what rkyv does)
I don't understand what you're suggesting about the OOM. If we're deserializing zero-copy, the memory has already been allocated, and we are just pointing to it. There is no memory allocation and therefore no opportunity for an OOM. OOM only happens if you're mutating.
Yeah I did somewhat intend to note that in the "we can't handle all these cases anyway" but I forgot. I think that's fair; I was thinking about streaming deserialization but that's not a thing here anyway, so I guess you're right about OOMs being a red herring.
I still feel like we should strongly avoid imposing costs (even minor ones) on reads when there is a design tradeoff that makes mutations a bit more painful, regardless of the OOM business.
The way to get zero-cost is to use an array or a struct. If you're using VZV, you are already incurring some cost. If it takes an additional 10% on a VZV read operation, and we get a big benefit as a result (infallible VZV that automatically sizes and scales itself), that's an acceptable tradeoff for me.
Here's an example of where a fixed-size VZV might not always work. I used month names as an example: in most cases, they will fit in a u8
-backed VZV. But, for some locales, perhaps those that use non-BMP code points, the month names may exceed the capacity of a u8
-backed VZV. With a generic parameter, we would need to force all locales to adopt a u16
VZV, costing data size for everyone.
If you're using VZV, you are already incurring some cost
Sure, but people who are considering VZV will be making a tradeoff between:
The second option is almost always better, but to me it's still worth optimizing.
Here's an example of where a fixed-size VZV might not always work. I used month names as an example: in most cases, they will fit in a
u8
-backed VZV. But, for some locales, perhaps those that use non-BMP code points, the month names may exceed the capacity of au8
-backed VZV. With a generic parameter, we would need to force all locales to adopt au16
VZV, costing data size for everyone.
I don't understand how this is relevant: I'm quite happy with a variable size VZV, what I'm pushing back against is making the length itself an independent variable size; I'd rather it be set as u32. I see it as vanishingly rare that someone will want a VZV with more than u32s worth of elements.
I don't understand how this is relevant: I'm quite happy with a variable size VZV, what I'm pushing back against is making the length itself an independent variable size; I'd rather it be set as u32. I see it as vanishingly rare that someone will want a VZV with more than u32s worth of elements.
Ok. I was taking the position that either VZV is fully fixed (both max length and index size) or fully flexible (both max length and index size). It sounds like you're also okay with an intermediate solution, where max length is fixed but index size is flexible.
The problem with the intermediate solution is that I think it still means we need fallible APIs. If we need fallible APIs, then the benefit of flexible indexes goes down substantially, since one of the main goals was to have infallible APIs.
We can make a read-efficient varint-like representation for the length, similar to UTF-8. I think Postcard already has something like this available; maybe we could leverage it.
Yes, Postcard has something like this already. I guess we could try that, I'm still worried about performance and the complexity inhibiting optimizations.
Are we okay as a group making 2^32 a hard cap on the number of elements contained in the data structure? And what do we do if we exceed it?
I am okay with (1) and (2) here, with (3) okay if it's a panic on debug. I personally am fine with panicking on mutation ops but I know we want to reduce panics so (2) is also okay for me. But we should definitely get feedback from the group.
It may be worth noting that if you somehow end up with 2^32 elements in your VZV, there may be an architectural problem you should be solving instead.
(3) with debug panic seems doable.
I have this issue marked discuss-priority.
Discussion:
!
operator to make fallible APIs infallible.!
thing will probably not help here, because the question is whether we will have the error or not.This is fixed by using FlexZeroVec (#1443)
FlexZeroVec is implemented, but we still need to try using it in VarZeroVec.
For 1.0, I want to save FZV-compatible bytes into the postcard files, but we don't need to use FZV yet. This lets us use it with a configurable feature in a compatible way.
At a high level, there are 3 paths:
Decision: pursue option 3. For 1.0, use 16-bit VZVs for most of ICU4X, and 32-bit VZVs for the parts that need it. Extend this to other types of VZVs down the road.
Once #2306 lands we should probably split this into followups about:
Polish-blockers done in #2306
Something else to fix in the context of this issue: When Index16 overflows it says "Attempted to build VarZeroVec out of elements that cumulatively are larger than a u32 in size" which is wrong
Error is being fixed in https://github.com/unicode-org/icu4x/issues/1410
I think you mean #5522?
Yep!
We are currently panicking in potentially unexpected ways when building a VarZeroVec. All code paths to construct a VZV from runtime objects pass through
VarZeroVecOwned::from_elements
, which has the following expectation:In addition, 32 bits is often much more than we need.
I suggest we consider making the length field in VZV configurable, either statically or dynamically. If we make it dynamically configurable, then we can safely remove all the places where we panic. To do this, we can use the first byte of the VZV buffer to signal how many bytes the indices are (1, 2, 4, 8, or 16... or perhaps even allowing odd numbers like 3 and 5), and then use that number when computing all offsets.
If this is not possible, we should at least change all the
From
impls toTryFrom
so that we aren't panicking inside of VZV in a way that clients cannot catch.Needs input from: