unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.35k stars 174 forks source link

VarZeroVec capacity and error handling #1410

Open sffc opened 2 years ago

sffc commented 2 years ago

We are currently panicking in potentially unexpected ways when building a VarZeroVec. All code paths to construct a VZV from runtime objects pass through VarZeroVecOwned::from_elements, which has the following expectation:

.expect("Attempted to build VarZeroVec out of elements that \
    cumulatively are larger than a u32 in size")

In addition, 32 bits is often much more than we need.

I suggest we consider making the length field in VZV configurable, either statically or dynamically. If we make it dynamically configurable, then we can safely remove all the places where we panic. To do this, we can use the first byte of the VZV buffer to signal how many bytes the indices are (1, 2, 4, 8, or 16... or perhaps even allowing odd numbers like 3 and 5), and then use that number when computing all offsets.

If this is not possible, we should at least change all the From impls to TryFrom so that we aren't panicking inside of VZV in a way that clients cannot catch.

Needs input from:

Manishearth commented 2 years ago

I would rather not make it dynamically configurable, since we lose out on the equality property: two VZVs in such a scheme may be logically equal even if not byte-equal. Static configurability with a default seems fine, iirc const generics don't support defaults yet but we can hardcode a constant and wait for that support to eventually exist to switch over.

And yeah, we should be more careful about the fallibility there.

sffc commented 2 years ago

Why can't we just have an invariant where the index width is always the smallest possible? If, when mutating, we pass a boundary either up or down, we just re-compute the index array.

Manishearth commented 2 years ago

Hmmm, tempting; though this does cause two problems:

Writing the mutation code to efficiently handle this will be really annoying, so I suspect we'll just have to use the existing mutation code and then iterate-reallocate. This isn't that big a deal, I guess.

I do think that the pathological case will be easier to hit for u8 lengths since it's easy to get to 256 on the entire buffer. What do you think of starting the lengths at u16?

We can also save the first byte by packing the length and the byte-width in the same number (this does lower our thresholds)

I think to start we should make N a parameter on VZVBorrowed/VZVComponents and a constant, and then try to move forward from there.

sffc commented 2 years ago

Hmmm, tempting; though this does cause two problems:

  • all operations on VZV get a little bit slower (this might be okay, but I'm very reluctant to introduce branches here)

I don't think this is a claim we can make until we have numbers. You're still calculating the same number of offsets; you just need to adjust them by a dynamic value loaded into a register rather than a constant in code. I don't see where the additional branching would be.

  • mutation can get really expensive, especially if you're dancing around a boundary. This gives us pathological cases for ZeroMap.

As you know, I've been skeptical for a long time about a focus on mutation operations. If you want to do a whole lot of work, convert the ZeroMap to a LiteMap/BTreeMap, mutate over there, and then bring it back to the ZeroMap after you're done. Mutation operations within ZeroMap should be rare, and when the occur, they should be small.

Writing the mutation code to efficiently handle this will be really annoying, so I suspect we'll just have to use the existing mutation code and then iterate-reallocate. This isn't that big a deal, I guess.

Actually, all you need to do is either add an empty byte or remove an empty byte for every entry in the index array. The numerical values within the index array stay the same. It's still a lot of shifting bytes around, of course.

I do think that the pathological case will be easier to hit for u8 lengths since it's easy to get to 256 on the entire buffer. What do you think of starting the lengths at u16?

I was quite excited about the prospect of a u8 VZV. We have a whole lot of cases that will fit. For example, a VarZeroVec<str> containing 7 weekday names or 12 month names will in most cases fit in a u8 VZV.

We can also save the first byte by packing the length and the byte-width in the same number (this does lower our thresholds)

Perhaps. This really only works if we start at u16 because we need all the space in the first byte with u8.

I think to start we should make N a parameter on VZVBorrowed/VZVComponents and a constant, and then try to move forward from there.

I agree that we can start here.

sffc commented 2 years ago

Note that I don't want to put in work and change APIs to be fallible if we eventually are going to do the auto-sizing approach. So we should keep this issue until we make the decision on that, even if we implement a const usize type parameter in the mean time.

Manishearth commented 2 years ago

Okay, I'm convinced. Let's start with the baked const (const generic on VZVComponents, baked const elsewhere), and try the dynamic approach after that.

We can stop on the way at a const generic approach but note that we can't currently do struct VZV<const N: usize = 1> since defaults on consts are unstable, so I'd prefer we go directly to dynamic (using the const generic as a stepping stone for the implementation, perhaps, but never released)

I don't think this is a claim we can make until we have numbers. You're still calculating the same number of offsets; you just need to adjust them by a dynamic value loaded into a register rather than a constant in code. I don't see where the additional branching would be.

Ah! I was imagining we'd stuff the width into the first byte of the length. This is probably unnecessary.

As you know, I've been skeptical for a long time about a focus on mutation operations. If you want to do a whole lot of work, convert the ZeroMap to a LiteMap/BTreeMap, mutate over there, and then bring it back to the ZeroMap after you're done. Mutation operations within ZeroMap should be rare, and when the occur, they should be small.

That's fair. I do think it's nice for mutation to be convenient but this is an okay position too.

Actually, all you need to do is either add an empty byte or remove an empty byte for every entry in the index array. The numerical values within the index array stay the same. It's still a lot of shifting bytes around, of course.

Yeah, I hadn't realized that it only affects the length buffer. This should not be hard to implement since we can "preexpand" or "precontract" the length buffer and then call the code Alyssa wrote. It's also possible but probably not worth it to do it as a one-step thing.

I was quite excited about the prospect of a u8 VZV. We have a whole lot of cases that will fit. For example, a VarZeroVec<str> containing 7 weekday names or 12 month names will in most cases fit in a u8 VZV.

Hmmm fair point.

sffc commented 2 years ago

We can stop on the way at a const generic approach but note that we can't currently do struct VZV<const N: usize = 1> since defaults on consts are unstable, so I'd prefer we go directly to dynamic (using the const generic as a stepping stone for the implementation, perhaps, but never released)

The const generic solution can be similar to what we did with FormattedString

pub struct VarZeroVecFlex<const N: usize> { ... }

pub type VarZeroVec = VarZeroVecFlex<4>;
Manishearth commented 2 years ago

yeah, i was thinking about that, and I was also thinking about how I'd just reduced the number of public VZV types :smile: (note that we'd need one of these for slice, owned, and cow). Could probably be done in an okayish way by only wrapping the cow one

sffc commented 2 years ago

An algorithm for expanding or contracting the index array can be written in O(N) time, with N = length of the byte buffer.

  1. When contracting, say from 4 bytes to 3 bytes, have two cursors walking forward through the index array: the tortoise walking 3 bytes at a time, and the hare walking 4 bytes at a time. At every interval, memmove 3 bytes from the hare pointer to the tortoise pointer. After you've looped over the entire index array, memmove the entire data section to its new position. Every byte has been read from or written to exactly once.
  2. When expanding, do this backwards: memmove the data section to make room for the wider indexes, and then have two cursors walking backwards, copying each index. One additional step is required, which is that the hare cursor must ensure that the high bit is zeroed out.
sffc commented 2 years ago

Another question is how exactly we define the boundary. I see two ways to define it:

  1. When the length of the entire buffer exceeds the capacity of the integer type
  2. When the length of the data section exceeds the capacity of the integer type or when the number of elements in the array exceeds the capacity of the integer type

We can squeeze more bytes out of definition 2, but it is harder to reason about.

Some edge cases:

sffc commented 2 years ago

Hmm. I wonder if we should have two width parameters: one for the width of the length field, and the other for the width of the index fields. There is no reason that these two need to be coupled. The length field can be a varint, since it is only one field.

In the VZV buffer:

Manishearth commented 2 years ago

I don't think it's worth attempting to save the 1-3 bytes we can save by optimizing for the length field having a different length

sffc commented 2 years ago

I don't think it's worth attempting to save the 1-3 bytes we can save by optimizing for the length field having a different length

Let's answer this question first.

If we make the cutoff be on the length of the data array, we can run into cases where there are more elements than the size of the data array (a vector of empty strings). It seems cleaner to decouple the two widths. This isn't about saving a few bytes.

For example, if you have a vector of 100,000 empty strings, you could allocate a 3-byte index array, or you could allocate a 1-byte index array, saving 200k of memory.

I acknowledge that this is an edge case, but it's food for thought to challenge our assumption of coupling these two numbers.

sffc commented 2 years ago

Another option would be to keep the number of elements fixed at 2^32 but allow the data array to scale in length. But then we still need to add fallible APIs.

Manishearth commented 2 years ago

Another option would be to keep the number of elements fixed at 2^32 but allow the data array to scale in length. But then we still need to add fallible APIs.

Yeah this is what I had thought the proposal was which is why I felt that it was saving a few bytes.

I'm okay with making things fallible: this was what I was envisioning from the start when you were talking about variable size leng. I don't care that much about the perf characteristics or API pleasantness of mutation.

sffc commented 2 years ago

I think the fallibility question is the big crux. Vec is not fallible, because it grows to however big it needs, which is as big as usize supports. In order to make VarZeroVec infallible, we need to make both the length field and the index array variable-width. If we say that VarZeroVec is fallible, then everyone mutating a VarZeroVec needs to be careful.

In some sense, if VZV is fallible, then we should stick with compile-time sizing, and make developers carefully pick the correct width for their use case.

But if we want VZV to be infallible, then it should scale everything automatically.

Manishearth commented 2 years ago

Note that 64 bit VZV won't work on WASM anyway, which is why I've been comfy having it be panicky/fallible.

But also, the only stuff that becomes fallible that aren''t fallible already are the mutation operations.

That seems ... fine to me? I guess it impacts ZeroMap too, which is unfortunate, but I'm okay with ZeroMap mutation panicking at 32 bit boundaries. Or even being fallible, though that sounds like it will grant us little benefit.

Basically, even if we make it stretchy, the panics will be replaced by OOMs on WASM.

sffc commented 2 years ago

If VZV is fallible, then we should make ZeroMap fallible, too. I'm not comfortable making VZV fallible but not ZeroMap.

I don't know how concerned I am with OOMs. That's something that can happen on any platform, not just WASM (although it's more likely on WASM since it's 32-bit). A discussion about how we should handle OOMs is the topic of a different thread.

Ultimately there are two options I'm okay with:

  1. VZV parameterized by the size of the index array, and all relevant mutation operations are fallible, and ZM becomes fallible
  2. VZV is fully flexible and infallible (OOM aside), and ZM remains infalible
Manishearth commented 2 years ago

If VZV is fallible, then we should make ZeroMap fallible, too. I'm not comfortable making VZV fallible but not ZeroMap.

I'm fine with both being fallible.

A discussion about how we should handle OOMs is the topic of a different thread

I don't quite agree: given that the primary use case for these structures is deserialization, being able to create VZVs that will fail to deserialize on some platforms seems like a footgun, and I think that that footgun also exists for ZeroVec. OOMs in general, yes, should be a separate topic, but this is about being able to reliably cause a deserialization failure.

Note that in Rust code this is not actually an OOM, a too-large sequence will typically cause the deserializer to return an error, but it's kind of moot anyway because if you have that much data then getting that data into memory will OOM.

While we can't prevent all such deserialization OOMs, I think it should be a data model error for a zerovec type that does not have a u32 byte length to be created. This would probably mean all mutation and construction operations become fallible, which is unfortunate, but neither are things we particularly care about the performance or ergonomicness of. To some extent I would even be okay with these panicking instead since ICU4X client code would be unlikely to ever be mutating these, but I understand there is some reluctance to that.


Ultimately there are two options I'm okay with:

1. VZV parameterized by the size of the index array, and all relevant mutation operations are fallible, and ZM becomes fallible

2. VZV is fully flexible and infallible (OOM aside), and ZM remains infalible

I prefer (1) since it does not incur any additional cost on the reader. It does incur an ergonomic cost on the writer, which is unfortunate, and I'm not super happy about that so I wouldn't be too opposed to (2).

(It's worth checking to see what rkyv does)

sffc commented 2 years ago

I don't understand what you're suggesting about the OOM. If we're deserializing zero-copy, the memory has already been allocated, and we are just pointing to it. There is no memory allocation and therefore no opportunity for an OOM. OOM only happens if you're mutating.

Manishearth commented 2 years ago

Yeah I did somewhat intend to note that in the "we can't handle all these cases anyway" but I forgot. I think that's fair; I was thinking about streaming deserialization but that's not a thing here anyway, so I guess you're right about OOMs being a red herring.

I still feel like we should strongly avoid imposing costs (even minor ones) on reads when there is a design tradeoff that makes mutations a bit more painful, regardless of the OOM business.

sffc commented 2 years ago

The way to get zero-cost is to use an array or a struct. If you're using VZV, you are already incurring some cost. If it takes an additional 10% on a VZV read operation, and we get a big benefit as a result (infallible VZV that automatically sizes and scales itself), that's an acceptable tradeoff for me.

Here's an example of where a fixed-size VZV might not always work. I used month names as an example: in most cases, they will fit in a u8-backed VZV. But, for some locales, perhaps those that use non-BMP code points, the month names may exceed the capacity of a u8-backed VZV. With a generic parameter, we would need to force all locales to adopt a u16 VZV, costing data size for everyone.

Manishearth commented 2 years ago

If you're using VZV, you are already incurring some cost

Sure, but people who are considering VZV will be making a tradeoff between:

The second option is almost always better, but to me it's still worth optimizing.

Here's an example of where a fixed-size VZV might not always work. I used month names as an example: in most cases, they will fit in a u8-backed VZV. But, for some locales, perhaps those that use non-BMP code points, the month names may exceed the capacity of a u8-backed VZV. With a generic parameter, we would need to force all locales to adopt a u16 VZV, costing data size for everyone.

I don't understand how this is relevant: I'm quite happy with a variable size VZV, what I'm pushing back against is making the length itself an independent variable size; I'd rather it be set as u32. I see it as vanishingly rare that someone will want a VZV with more than u32s worth of elements.

sffc commented 2 years ago

I don't understand how this is relevant: I'm quite happy with a variable size VZV, what I'm pushing back against is making the length itself an independent variable size; I'd rather it be set as u32. I see it as vanishingly rare that someone will want a VZV with more than u32s worth of elements.

Ok. I was taking the position that either VZV is fully fixed (both max length and index size) or fully flexible (both max length and index size). It sounds like you're also okay with an intermediate solution, where max length is fixed but index size is flexible.

The problem with the intermediate solution is that I think it still means we need fallible APIs. If we need fallible APIs, then the benefit of flexible indexes goes down substantially, since one of the main goals was to have infallible APIs.

sffc commented 2 years ago

We can make a read-efficient varint-like representation for the length, similar to UTF-8. I think Postcard already has something like this available; maybe we could leverage it.

Manishearth commented 2 years ago

Yes, Postcard has something like this already. I guess we could try that, I'm still worried about performance and the complexity inhibiting optimizations.

sffc commented 2 years ago

Are we okay as a group making 2^32 a hard cap on the number of elements contained in the data structure? And what do we do if we exceed it?

  1. Panic?
  2. Return an error and make everything fallible?
  3. Drop the element and silently refuse to insert it?
Manishearth commented 2 years ago

I am okay with (1) and (2) here, with (3) okay if it's a panic on debug. I personally am fine with panicking on mutation ops but I know we want to reduce panics so (2) is also okay for me. But we should definitely get feedback from the group.

sffc commented 2 years ago

It may be worth noting that if you somehow end up with 2^32 elements in your VZV, there may be an architectural problem you should be solving instead.

(3) with debug panic seems doable.

I have this issue marked discuss-priority.

sffc commented 2 years ago

Discussion:

sffc commented 2 years ago

This is fixed by using FlexZeroVec (#1443)

sffc commented 2 years ago

FlexZeroVec is implemented, but we still need to try using it in VarZeroVec.

sffc commented 2 years ago

For 1.0, I want to save FZV-compatible bytes into the postcard files, but we don't need to use FZV yet. This lets us use it with a configurable feature in a compatible way.

sffc commented 2 years ago

At a high level, there are 3 paths:

  1. A feature flag somewhere (either zerovec or icu_provider) that changes ALL VZV impls to use fast mode or slow mode, with appropriate metadata added to data requests
  2. Add a metadata byte to VZV so that the same data files can power both the fast and small VZV impls, but size-concisous consumers can build their own small files
  3. Make VZV fallible and generic and decide at each individual call site which one is most appropriate, expecting that in most cases we use a 16-bit version
sffc commented 2 years ago

Decision: pursue option 3. For 1.0, use 16-bit VZVs for most of ICU4X, and 32-bit VZVs for the parts that need it. Extend this to other types of VZVs down the road.

Manishearth commented 2 years ago

Followup: https://github.com/unicode-org/icu4x/issues/2312

Manishearth commented 2 years ago

Once #2306 lands we should probably split this into followups about:

Manishearth commented 2 years ago

Polish-blockers done in #2306

sffc commented 9 months ago

Something else to fix in the context of this issue: When Index16 overflows it says "Attempted to build VarZeroVec out of elements that cumulatively are larger than a u32 in size" which is wrong

Manishearth commented 2 weeks ago

Error is being fixed in https://github.com/unicode-org/icu4x/issues/1410

sffc commented 2 weeks ago

I think you mean #5522?

Manishearth commented 2 weeks ago

Yep!