Closed sffc closed 3 years ago
My initial response is that neither of those options really work optimally unfortunately and I don't know if we can have an optional architecture, but I'd like to try. It may also be that the answer differs between Rust clients and FFI clients.
For Rust clients, in my ideal world DataProvider owns data that is loaded. The problem of data growth is real and would be addressed either by FIFO cache, or some Rc/Arc.
The reason I would prefer that is that I imagine clients wanting to "just" load DateTimeFormat, or NumberFormat and use it. Such formatter constructor would then ask DataProvider for data and on the first request, the provider loads the data, and on the consecutive the data is already there.
In result I'd expect Provider to return borrowed Cow of the data to the constructor and each DateTimeFormat
and NumberFormat
instance carries that data around for as long as the provider is alive.
This is also important for runtime processing. I imagine that we may load unparsed patterns, and parse them on the first load, and the consecutive loads would get parsed patterns unless they got GCed out.
I can see a claim that when DTF
or NF` go out of scope they decrease the rc count in the provider and if it goes to zero the provider may clean up data from oldest to newest as needed.
The story may not cater well to the FFI scenario where we may need to return owned data over the FFI and allow it to live in the instance.
So, seems like I'm also nudging in the direction of option1+2, but I'm a bit stronger on option 1 than you are.
The problem of data growth is real and would be addressed either by FIFO cache, or some Rc/Arc.
Option 1 is incompatible with a FIFO cache on the FsDataProvider layer, because the data provider cannot know whether a downstream formatter is retaining references to data. Rc/Arc might work; I just don't have a design for it yet. I think it would require that the DataProvider .load()
function returns something with an Rc/Arc in it. Also, I feel like if you need Rc/Arc, you can probably come up with an architecture that doesn't require it.
each
DateTimeFormat
andNumberFormat
instance carries that data around for as long as the provider is alive.
What do you propose to happen if the provider is destroyed but there are still formatters that have references to data it owns?
This is also important for runtime processing. I imagine that we may load unparsed patterns, and parse them on the first load, and the consecutive loads would get parsed patterns unless they got GCed out.
+1. This is an easier problem to solve, and it is compatible with both Option 1 and Option 2. I don't want to conflate that problem with the problem of raw data ownership (e.g., strings, ZeroVecs).
What do you propose to happen if the provider is destroyed but there are still formatters that have references to data it owns?
When writing that sentence my thinking was to make it impossible in Rust for a DTF instance to outlive the provider. It would just require that the lifetime of the &dp
is longer than the lifetime of the DTF
instance.
Not sure if this is viable.
What do you propose to happen if the provider is destroyed but there are still formatters that have references to data it owns?
When writing that sentence my thinking was to make it impossible in Rust for a DTF instance to outlive the provider. It would just require that the lifetime of the
&dp
is longer than the lifetime of theDTF
instance. Not sure if this is viable.
Okay. This is actually already the case; DateTimeFormat has a 'd
lifetime which is tied to the DataProvider argument.
@Manishearth: An option 3 would be to put a Vec<u8>
in the data request, and data is owned by that buffer. But in that case it might make more sense to just return owned data. Going in this direction could allow the caller to do interesting things like using an arena or their own reference counting.
2021-04-22: The only option really off the table is Option 1 without reference counting.
We just finished 2yr long project of removing something like Option 1 from our code base - it would be best if we could avoid cache + reference/pointers combination. Major reason is that it's not really a cache, because you can't evict data.
Idea where we have locale based provider is better, but it's just a thin wrapper around general data provider. Users can still shoot themselves in the foot by creating large number of locale based providers, but in that case it's user mistake.
I would fast forward couple iterations and land on a hybrid model, where in some cases we return a reference to a large dataset (like timezone names), and in other cases we return a value (date patterns). Returning a value works great with caching, and we can implement efficient cache eviction rules in that case.
This approach would require having two types of data providers - reference and value based. We also need to denote, possibly in data, which data set is to be returned as reference and which as value. It can also be combined with locale data providers to limit the data growth.
The question is - are we more concerned with performance hit from complicating data provider logic or potential memory inflation due to user error?
I and @sffc have been discussing Rc
-based approaches all day, and most of it has been him suggesting approaches and me saying the approach won't work well :smile:
Firstly, a recap of existing approaches so that we have a list in one place, and their downsides
FsDataProvider
owns data
'd
and 's
lifetimes everywhereFsDataProvider
can never throw away anything, if you want to manage the lifecycle of that data, make new FsDataProvider
s
Rc
ing with a cache's
if we're trying to be zero copy
RcCow
(RcZeroVec
, etc) which can be owned or Rc
, use instead of Cow
in data structsRc
to their backing bufferI won't discuss option 3.1 as much since option 3.2 is better in every way, but we did discuss ideas in that space today. @sffc may write some more about the various options available to us later.
On to the actual design of Option 3.2, which I may later turn into a design doc to check in.
The typical data struct will look something like this:
struct MyDataStruct<'s> {
cow: Cow<'s, str>,
vec: ZeroVec<'s, u32>
integer: u64,
// ...
}
This struct can contain fully owned data, but it may also contain data that has been zero-copy deserialized from a u8
buffer, containing references to that buffer.
Ideally, we'd be able to carry around a struct like
struct SharedMyDataStruct {
rc: Rc<[u8]>, // or some kind of reference counting
data: MyDataStruct<'self>
}
such that the MyDataStruct
borrows from the rc
. We can then cache this buffer in the data provider as necessary. The data provider can also cache MyDataStruct
objects alongside the buffer to avoid reparsing.
Of course, 'self
doesn't exist in Rust.
Now, we can implement SharedMyDataStruct
using unsafe code to not need 'self
. Ultimately this just needs to behave like a MyDataStruct<'self>
with a tagged-on destructor. This is easy enough:
struct SharedMyDataStruct {
rc: Rc<[u8]>, // or some kind of reference counting
data: MyDataStruct<'static> // the 'static is a lie
}
impl SharedMyDataStruct {
fn get<'a>(&'a self) -> &'a MyDataStruct<'a> { // note the lifetime!
// transmute the lifetime of &self.data
}
}
It's not safe to return MyDataStruct<'_>
with lifetime other than 'a
there (since we do not statically know the lifetime of the backing buffer, we just know that it is at least 'a
)
This is all well and good. Except this can't be done generically -- we will have many data struct types and it would be a huge pain to write this struct for each of them. What we want is:
struct SharedData<T> {
rc: Rc<[u8]>, // or some kind of reference counting
data: T<'static> // the 'static is a lie
}
impl<T> SharedData<T> {
fn get<'a>(&'a self) -> &'a T<'a> {..}
fn get_mut<'a>(&'a mut self) -> &'a mut T<'a> {..}
}
// some kind of `Deserialize` impl that can construct this object zero-copy?
except of course this isn't expressible, not without GATs. In current Rust there is no way to generically talk about "T
, but with a different lifetime" -- T: 'a
bounds T
but doesn't swap lifetimes.
We can, however, manually build this scaffolding:
unsafe trait LifetimeTransform<'a>: 'static {
type Output: 'a; // MUST be `Self` with the lifetime swapped
// used by `SharedData::get()`
fn transform(&'a self) -> &'a Self::Output;
fn transform_mut(&'a mut self) -> &'a mut Self::Output;
// used for deserialization. Safety constraint: `Self` must be
// destroyed before the data `Self::Output` was deserialized
// from is
unsafe fn make(Self::Output) -> Self;
}
unsafe impl<'a> LifetimeTransform<'a> for MyDataStruct<'static> {
type Output = MyDataStruct<'a>
// impls are transmutes
}
The trait is unsafe
to implement since we're imposing an additional constraint on what we expect Output
to be. It might be possible to keep the trait safe and just require unsafe in the implementations, but it's easier to err on the side of caution here.
SharedData
becomes:
struct SharedData<T> {
rc: Rc<[u8]>, // or some kind of reference counting
data: T<'static> // the 'static is a lie
}
impl<T: for<'a> LifetimeTransform<'a>> SharedData<T> {
fn get<'a>(&'a self) -> &'a T::Output {..}
fn get_mut<'a>(&'a mut self) -> &'a mut T::Output {..}
}
// some kind of `Deserialize` impl that can construct this object zero-copy using `LifetimeTransform::make()`
// where you zero-copy deserialize a `T`, `LifetimeTransform::make()` it, and wrap it in a `SharedData<T>` with
// the correct `Rc`
Firstly we can swap the manual impl of LifetimeTransform
with #[derive(LifetimeTransform)]
. This custom derive will just use transmutes, but will also assert that all of the child fields have LifetimeTransform
implementations that conform to the constraints.
We do care about patching, and SharedData<T>
is patchable on its own. We'd make DataPayload<T>
be a thin wrapper around SharedData<T>
.
We can allow for non-Rc-backed data by simply making SharedData<T>
use a nullable rc
field. We can also support Arc
by making that field be an enum.
We may eventually need a #[derive(IntoFullyOwned)]
custom derive that can deep-copy-on-write to construct fully 'static
owned data structs.
's
and 'd
! No more gnarly lifetimes!Arc
support if necessary.Is get_mut()
actually safe? I have this feeling it should return Pin<&mut T::Output>
, but I can't figure out a reason why. It should be possible to write to the T
, the borrowed data can only be replaced with owned data, not overwritten.
Fixed by #745
We use
's
in data structs to allow them to borrow data via zero-copy deserialization. If data is statically allocated, we simply point to that pre-loaded buffer, as ICU4C does now. However, how do we handle ownership for dynamically loaded data, such as from a filesystem, RPC service, or network request?Currently, we always return data structs that own their data (
's: 'static
), due to #632. When we fix #632, we will need to have a story for how data providers that dynamically load data, like FsDataProvider, fit into the picture.We also must think about how
'd
, the lifetime of the data structs themselves, fits into the picture.I see ~two~ three directions.
Option 1: FsDataProvider owns the data
Option 1 means FsDataProvider would lazy-load data into memory and then return pointers to it.
's
equals the lifetime of the FsDataProvider.This is challenging because a singleton FsDataProvider may have unbounded memory growth, since we cannot "delete" old data since we can't know if someone still has a reference to it. (See Option 3 for a solution based on reference counting.)
We could limit the scope of this problem by recommending or requiring that:
The other problem with this option is that it basically turns FsDataProvider into an (unbounded) data cache. I was hoping to keep the data cache as a separate component.
Option 2: Data struct owns the data
Option 2 means FsDataProvider would only ever return data that is owned by the data struct, such that the data provider is not tied to the lifetime of the data structs it returns.
This is challenging because after #632 is fixed, we would need a way to programmatically convert a data struct to owned (
's: 'static
). We would need to make a trait, sayIntoOwned
, that converts data structs into'static
, and implement that trait on all data structs and all types that are encapsulated by data structs (like ZeroVec). We would want to make sure thatIntoOwned
is as efficient as possible, since this would be on a critical path in data loading.Option 2 could be paired with a cache to help with performance and avoiding repeated loads of data. However, doing so would preclude the possibility of returning
'd
(borrowing the data struct), because then we fall into the same trap as Option 1.Option 3: Reference count the data
With this solution, we remove the
's
lifetime from the data provider and "replace" it with a reference-counted buffer. @Manishearth explains how this could work in a reply farther down in this thread.With reference counted buffers, data caches can safely drop data without invalidating downstream users of that data, because the downstream users will continue to retain an
Rc
to the data. Once the downstream users are dropped, the reference count will reach 0 and the buffers will be deallocated.There are two locations where we can add the
Rc
s:Rc<[u8]>
Rc<[u8]>
alongside the data struct and use self-referencesOption 3.1 has several problems, not the least of which is that it forces the
Rc
s to "infect" all fields of the data struct, which likely comes with a performance impact. The only real disadvantage of Option 3.2 is that self-references require a bit of unsafe code in Rust that may be difficult to reason about. I prefer Option 3.2 and taking the time to verify its correctness.With this solution, we would also drop
'd
, because calling.clone()
on a data struct will be a cheap operation, since it won't require allocating any memory.Summary
~Option 2 fits best with my mental model. However, we may want to support both Options 1 and 2, because they are useful in different cases. For example, a data provider that downloads a whole language pack may wish to adopt Option 1, whereas one that pulls from an RPC service may wish to use Option 2.~
UPDATE: I now think that Option 3.2 is a workable solution that solves our known problems.
CC @Manishearth @nciric