unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.34k stars 174 forks source link

Data ownership for zero-copy data structs #667

Closed sffc closed 3 years ago

sffc commented 3 years ago

We use 's in data structs to allow them to borrow data via zero-copy deserialization. If data is statically allocated, we simply point to that pre-loaded buffer, as ICU4C does now. However, how do we handle ownership for dynamically loaded data, such as from a filesystem, RPC service, or network request?

Currently, we always return data structs that own their data ('s: 'static), due to #632. When we fix #632, we will need to have a story for how data providers that dynamically load data, like FsDataProvider, fit into the picture.

We also must think about how 'd, the lifetime of the data structs themselves, fits into the picture.

I see ~two~ three directions.

Option 1: FsDataProvider owns the data

Option 1 means FsDataProvider would lazy-load data into memory and then return pointers to it. 's equals the lifetime of the FsDataProvider.

This is challenging because a singleton FsDataProvider may have unbounded memory growth, since we cannot "delete" old data since we can't know if someone still has a reference to it. (See Option 3 for a solution based on reference counting.)

We could limit the scope of this problem by recommending or requiring that:

  1. FsDataProvider can load data for only one locale or component at a time
  2. You create a locally-scoped FsDataProvider instead of creating a global

The other problem with this option is that it basically turns FsDataProvider into an (unbounded) data cache. I was hoping to keep the data cache as a separate component.

Option 2: Data struct owns the data

Option 2 means FsDataProvider would only ever return data that is owned by the data struct, such that the data provider is not tied to the lifetime of the data structs it returns.

This is challenging because after #632 is fixed, we would need a way to programmatically convert a data struct to owned ('s: 'static). We would need to make a trait, say IntoOwned, that converts data structs into 'static, and implement that trait on all data structs and all types that are encapsulated by data structs (like ZeroVec). We would want to make sure that IntoOwned is as efficient as possible, since this would be on a critical path in data loading.

Option 2 could be paired with a cache to help with performance and avoiding repeated loads of data. However, doing so would preclude the possibility of returning 'd (borrowing the data struct), because then we fall into the same trap as Option 1.

Option 3: Reference count the data

With this solution, we remove the 's lifetime from the data provider and "replace" it with a reference-counted buffer. @Manishearth explains how this could work in a reply farther down in this thread.

With reference counted buffers, data caches can safely drop data without invalidating downstream users of that data, because the downstream users will continue to retain an Rc to the data. Once the downstream users are dropped, the reference count will reach 0 and the buffers will be deallocated.

There are two locations where we can add the Rcs:

  1. Option 3.1: Add a variant to ZeroVec and friends which is a Rc<[u8]>
  2. Option 3.2: Store a single Rc<[u8]> alongside the data struct and use self-references

Option 3.1 has several problems, not the least of which is that it forces the Rcs to "infect" all fields of the data struct, which likely comes with a performance impact. The only real disadvantage of Option 3.2 is that self-references require a bit of unsafe code in Rust that may be difficult to reason about. I prefer Option 3.2 and taking the time to verify its correctness.

With this solution, we would also drop 'd, because calling .clone() on a data struct will be a cheap operation, since it won't require allocating any memory.

Summary

~Option 2 fits best with my mental model. However, we may want to support both Options 1 and 2, because they are useful in different cases. For example, a data provider that downloads a whole language pack may wish to adopt Option 1, whereas one that pulls from an RPC service may wish to use Option 2.~

UPDATE: I now think that Option 3.2 is a workable solution that solves our known problems.

CC @Manishearth @nciric

zbraniecki commented 3 years ago

My initial response is that neither of those options really work optimally unfortunately and I don't know if we can have an optional architecture, but I'd like to try. It may also be that the answer differs between Rust clients and FFI clients.

For Rust clients, in my ideal world DataProvider owns data that is loaded. The problem of data growth is real and would be addressed either by FIFO cache, or some Rc/Arc.

The reason I would prefer that is that I imagine clients wanting to "just" load DateTimeFormat, or NumberFormat and use it. Such formatter constructor would then ask DataProvider for data and on the first request, the provider loads the data, and on the consecutive the data is already there. In result I'd expect Provider to return borrowed Cow of the data to the constructor and each DateTimeFormat and NumberFormat instance carries that data around for as long as the provider is alive.

This is also important for runtime processing. I imagine that we may load unparsed patterns, and parse them on the first load, and the consecutive loads would get parsed patterns unless they got GCed out.

I can see a claim that when DTF or NF` go out of scope they decrease the rc count in the provider and if it goes to zero the provider may clean up data from oldest to newest as needed.

The story may not cater well to the FFI scenario where we may need to return owned data over the FFI and allow it to live in the instance.

So, seems like I'm also nudging in the direction of option1+2, but I'm a bit stronger on option 1 than you are.

sffc commented 3 years ago

The problem of data growth is real and would be addressed either by FIFO cache, or some Rc/Arc.

Option 1 is incompatible with a FIFO cache on the FsDataProvider layer, because the data provider cannot know whether a downstream formatter is retaining references to data. Rc/Arc might work; I just don't have a design for it yet. I think it would require that the DataProvider .load() function returns something with an Rc/Arc in it. Also, I feel like if you need Rc/Arc, you can probably come up with an architecture that doesn't require it.

each DateTimeFormat and NumberFormat instance carries that data around for as long as the provider is alive.

What do you propose to happen if the provider is destroyed but there are still formatters that have references to data it owns?

This is also important for runtime processing. I imagine that we may load unparsed patterns, and parse them on the first load, and the consecutive loads would get parsed patterns unless they got GCed out.

+1. This is an easier problem to solve, and it is compatible with both Option 1 and Option 2. I don't want to conflate that problem with the problem of raw data ownership (e.g., strings, ZeroVecs).

zbraniecki commented 3 years ago

What do you propose to happen if the provider is destroyed but there are still formatters that have references to data it owns?

When writing that sentence my thinking was to make it impossible in Rust for a DTF instance to outlive the provider. It would just require that the lifetime of the &dp is longer than the lifetime of the DTF instance. Not sure if this is viable.

sffc commented 3 years ago

What do you propose to happen if the provider is destroyed but there are still formatters that have references to data it owns?

When writing that sentence my thinking was to make it impossible in Rust for a DTF instance to outlive the provider. It would just require that the lifetime of the &dp is longer than the lifetime of the DTF instance. Not sure if this is viable.

Okay. This is actually already the case; DateTimeFormat has a 'd lifetime which is tied to the DataProvider argument.

sffc commented 3 years ago

@Manishearth: An option 3 would be to put a Vec<u8> in the data request, and data is owned by that buffer. But in that case it might make more sense to just return owned data. Going in this direction could allow the caller to do interesting things like using an arena or their own reference counting.

sffc commented 3 years ago

2021-04-22: The only option really off the table is Option 1 without reference counting.

nciric commented 3 years ago

We just finished 2yr long project of removing something like Option 1 from our code base - it would be best if we could avoid cache + reference/pointers combination. Major reason is that it's not really a cache, because you can't evict data.

Idea where we have locale based provider is better, but it's just a thin wrapper around general data provider. Users can still shoot themselves in the foot by creating large number of locale based providers, but in that case it's user mistake.

I would fast forward couple iterations and land on a hybrid model, where in some cases we return a reference to a large dataset (like timezone names), and in other cases we return a value (date patterns). Returning a value works great with caching, and we can implement efficient cache eviction rules in that case.

This approach would require having two types of data providers - reference and value based. We also need to denote, possibly in data, which data set is to be returned as reference and which as value. It can also be combined with locale data providers to limit the data growth.

The question is - are we more concerned with performance hit from complicating data provider logic or potential memory inflation due to user error?

Manishearth commented 3 years ago

I and @sffc have been discussing Rc-based approaches all day, and most of it has been him suggesting approaches and me saying the approach won't work well :smile:

Recap

Firstly, a recap of existing approaches so that we have a list in one place, and their downsides

  1. Option 1: FsDataProvider owns data
    • We have to deal with 'd and 's lifetimes everywhere
    • FsDataProvider can never throw away anything, if you want to manage the lifecycle of that data, make new FsDataProviders
      1. Option 2: Data structs own data
    • Extra parsing/cloning, or Rcing with a cache
    • Still need to care about 's if we're trying to be zero copy
      1. Option 3.1 (new): Have an RcCow (RcZeroVec, etc) which can be owned or Rc, use instead of Cow in data structs
    • This is pervasive and needs us to update everything, and needs self referential lifetimes
      1. Option 3.2 (new, good): Have some way of carrying around data structs along with an Rc to their backing buffer
    • This also needs self referential lifetimes in a complicated way

I won't discuss option 3.1 as much since option 3.2 is better in every way, but we did discuss ideas in that space today. @sffc may write some more about the various options available to us later.

Design of Option 3.2

On to the actual design of Option 3.2, which I may later turn into a design doc to check in.

The rough idea

The typical data struct will look something like this:

struct MyDataStruct<'s> {
   cow: Cow<'s, str>,
   vec: ZeroVec<'s, u32>
   integer: u64,
   // ... 
}

This struct can contain fully owned data, but it may also contain data that has been zero-copy deserialized from a u8 buffer, containing references to that buffer.

Ideally, we'd be able to carry around a struct like

struct SharedMyDataStruct {
   rc: Rc<[u8]>, // or some kind of reference counting
   data: MyDataStruct<'self>
}

such that the MyDataStruct borrows from the rc. We can then cache this buffer in the data provider as necessary. The data provider can also cache MyDataStruct objects alongside the buffer to avoid reparsing.

Of course, 'self doesn't exist in Rust.

Transforming lifetimes

Now, we can implement SharedMyDataStruct using unsafe code to not need 'self. Ultimately this just needs to behave like a MyDataStruct<'self> with a tagged-on destructor. This is easy enough:

struct SharedMyDataStruct {
   rc: Rc<[u8]>, // or some kind of reference counting
   data: MyDataStruct<'static> // the 'static is a lie
}

impl SharedMyDataStruct {
   fn get<'a>(&'a self) -> &'a MyDataStruct<'a> { // note the lifetime!
       // transmute the lifetime of &self.data
   }
}

It's not safe to return MyDataStruct<'_> with lifetime other than 'a there (since we do not statically know the lifetime of the backing buffer, we just know that it is at least 'a)

This is all well and good. Except this can't be done generically -- we will have many data struct types and it would be a huge pain to write this struct for each of them. What we want is:

struct SharedData<T> {
   rc: Rc<[u8]>, // or some kind of reference counting
   data: T<'static> // the 'static is a lie
}

impl<T> SharedData<T> {
   fn get<'a>(&'a self) -> &'a T<'a> {..}
   fn get_mut<'a>(&'a mut self) -> &'a mut T<'a> {..}
}

// some kind of `Deserialize` impl that can construct this object zero-copy?

except of course this isn't expressible, not without GATs. In current Rust there is no way to generically talk about "T, but with a different lifetime" -- T: 'a bounds T but doesn't swap lifetimes.

Build our own GATs

We can, however, manually build this scaffolding:

unsafe trait LifetimeTransform<'a>: 'static {
   type Output: 'a; // MUST be `Self` with the lifetime swapped
   // used by `SharedData::get()`
   fn transform(&'a self) -> &'a Self::Output;
   fn transform_mut(&'a mut self) -> &'a mut Self::Output;
   // used for deserialization. Safety constraint: `Self` must be
   // destroyed before the data `Self::Output` was deserialized
   // from is
   unsafe fn make(Self::Output) -> Self;
}

unsafe impl<'a> LifetimeTransform<'a> for MyDataStruct<'static> {
  type Output = MyDataStruct<'a>
  // impls are transmutes
}

The trait is unsafe to implement since we're imposing an additional constraint on what we expect Output to be. It might be possible to keep the trait safe and just require unsafe in the implementations, but it's easier to err on the side of caution here.

SharedData becomes:

struct SharedData<T> {
   rc: Rc<[u8]>, // or some kind of reference counting
   data: T<'static> // the 'static is a lie
}

impl<T: for<'a> LifetimeTransform<'a>> SharedData<T> {
   fn get<'a>(&'a self) -> &'a T::Output {..}
   fn get_mut<'a>(&'a mut self) -> &'a mut T::Output {..}
}

// some kind of `Deserialize` impl that can construct this object zero-copy using `LifetimeTransform::make()`
// where you zero-copy deserialize a `T`, `LifetimeTransform::make()` it, and wrap it in a `SharedData<T>` with
// the correct `Rc`

Make it pleasant

Firstly we can swap the manual impl of LifetimeTransform with #[derive(LifetimeTransform)]. This custom derive will just use transmutes, but will also assert that all of the child fields have LifetimeTransform implementations that conform to the constraints.

We do care about patching, and SharedData<T> is patchable on its own. We'd make DataPayload<T> be a thin wrapper around SharedData<T>.

We can allow for non-Rc-backed data by simply making SharedData<T> use a nullable rc field. We can also support Arc by making that field be an enum.

We may eventually need a #[derive(IntoFullyOwned)] custom derive that can deep-copy-on-write to construct fully 'static owned data structs.

Major benefits

Lingering concerns

Is get_mut() actually safe? I have this feeling it should return Pin<&mut T::Output>, but I can't figure out a reason why. It should be possible to write to the T, the borrowed data can only be replaced with owned data, not overwritten.

sffc commented 3 years ago

Fixed by #745