unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.38k stars 176 forks source link

Should compiled data constructors be on the owned or borrowed type? #5440

Open sffc opened 2 months ago

sffc commented 2 months ago

This came up in https://github.com/unicode-org/icu4x/pull/5413#discussion_r1724947448.

On types that have an owned and a borrowed variant, should the compiled data constructors build the owned or the borrowed variant? For example, should it be:

// Option 1: all constructors on owned type
Foo::new() -> FooBorrowed<'static>
Foo::default() -> Foo
Foo::try_new_unstable(...) -> Result<Foo>

or

// Option 2: constructor on borrowed type; Default impl available
FooBorrowed::new() -> FooBorrowed<'static>
Foo::default() -> Foo
Foo::try_new_unstable(...) -> Result<Foo>

or

// Option 3: constructors on both types
FooBorrowed::new() -> FooBorrowed<'static>
Foo::new() -> Foo
Foo::default() -> Foo
Foo::try_new_unstable(...) -> Result<Foo>

I think Foo::default() -> Foo should always be available because that's the way to get an actual Foo from compiled data, and clients might need an actual Foo if the object could be created multiple different ways.

I don't have a super strong opinion between options 1, 2, and 3.

@Manishearth @zbraniecki @hsivonen @robertbastian @nekevss

Manishearth commented 2 months ago

I have previously been unhappy that the properties constructors only let you get static borrowed types, and would prefer that we have cheap ways to get owned types compatible with ones loaded from data structs. It's kinda annoying to have this split in types.

But it also seems to work well. It's unclear where we should be drawing the line.

sffc commented 2 months ago

For the properties types, my Default solution won't work, because there is not a single constructor for those types. So we should probably come up with a general solution that doesn't depend on Default.

Here's another option

FooBorrowed::new() -> FooBorrowed<'static>
FooBorrowed::static_to_owned(self) -> Foo

Foo::try_new_unstable(...) -> Result<Foo>
Foo::as_borrowed(&self) -> FooBorrowed
sffc commented 2 months ago

Here's another question. In 2.0, should the types be named Foo and FooBorrowed, or should they be named FooOwned (or FooData) and Foo?

Manishearth commented 2 months ago

Hmm, if the borrowed types are the main ones it does make sense to have them be the primary ones.

But the borrowed types are the main ones only for singleton types. Are we okay with there being Foo<'static>/FooOwned (singleton) and Foo (and potentially FooBorrowed<'a>) for non-singletons? I think that's fine, we would rarely be doing FooBorrowed anyway for non-singletons.

sffc commented 2 months ago

Collator is the main example of a type that would be locale-dependent but might benefit from a Borrowed version.

We can make databake capable of returning the right thing even with a fallible try_new on the borrowed type. It's more work, though, and I don't really want to block 2.0 on that.

It's also more work, but since it seems like we probably want to generally move in this direction, we can just consistently apply this pattern everywhere:

pub struct FooFormatterData {
    payload: DataPayload<DataStructV1Marker>
}

pub struct FooFormatter<'a> {
    payload: &'a DataStructV1<'a>
}

impl FooFormatterData {
    pub fn get(&self) -> FooFormatter { ... }
}

impl<'a> FooFormatter<'a> {
    pub fn clone_with_data(&self) -> FooFormatterData { ... }
}

impl FooFormatter<'static> {
    pub fn static_to_owned(&self) -> FooFormatterData { ... }
}

Maybe for 2.0 we stick with the current naming, and in 3.0 we can consider switching things around, only because this wasn't on the table initially for 2.0 and we can't keep adding things to that milestone.

I'm pretty sure that, names aside, we can do this as an incremental migration. We can create FixedDecimalFormatterBorrowed and with whatever functions it needs, and the owned type continues to have direct formatting functions. Then in 3.0 we can remove the format functions from the owned type and do the project-wide renaming.

In hindsight this design seems pretty obvious and I don't know why we didn't consider it sooner.

sffc commented 2 months ago

@robertbastian prefers keeping the four constructors together in the docs. @robertbastian therefore prefers:

// Always:
Foo::new() -> FooBorrowed<'static>
Foo::try_new_unstable(...) -> Result<Foo>
Foo::as_borrowed() -> FooBorrowed

@sffc points out:

  1. It might be necessary for FooBorrowed to have its own constructor if the type Foo is feature-gated
  2. I think it is slightly strange/unconventional for a constructor on T to return something that is not T

@robertbastian says we could add FooBorrowed::new() if Foo is feature-gated.

Manishearth commented 2 months ago

You never want to add a function that only shows up when a feature is disabled, then global feature resolution will break code relying on the disabled crate.

sffc commented 2 months ago

One thing @robertbastian said which I agree with is that Foo::new() -> Foo is a bad function because it unnecessarily wraps the static ref in the enum (DataPayload) which leads to less efficient code.

I think we need FooBorrowed::new() since Foo could be feature-gated in some cases.

As noted previously, I think we should stick with the Foo/FooBorrowed names in 2.0, and possibly revisit them in 3.0.

So I think the choices are:

// Option 1 (2024-09-03): Foo::new() and FooBorrowed::new()
Foo::new() -> FooBorrowed
Foo::try_new_unstable() -> Foo
// Always have this function available:
FooBorrowed::new() -> FooBorrowed

// Option 2 (2024-09-03): Foo::new() and optional FooBorrowed::new()
Foo::new() -> FooBorrowed
Foo::try_new_unstable() -> Foo
// Add this function only when Foo actually gets feature-gated:
FooBorrowed::new() -> FooBorrowed

// Option 3 (2024-09-03): FooBorrowed::new() only
Foo::try_new_unstable() -> Foo
FooBorrowed::new() -> FooBorrowed

Either way, I think we should always have FooBorrowed::static_to_owned().

I can see pros and cons of all these choices so I won't weigh in yet. I'll see what others think.

Manishearth commented 2 months ago

Not opposed to 1 or 3. Opposed to 2, we shouldn't have any APIs that disappear when you enable a feature gate, that gets very brittle in larger deptrees. I guess we could do some fancy doc(hidden) stuff but most people are unlikely to browse ICU4X docs without --all-features..

No strong opinion between 1 and 3, provided 3 has sufficient documentation pointing things to each other.

sffc commented 2 months ago

I think the intent with 2 is that we create the function in code at the same time as we create the feature that gates the owned type, and otherwise we don't have the function available.

Manishearth commented 2 months ago

Ah! Okay, that's acceptable to me. No strong opinion.

hsivonen commented 2 months ago

My preference would be to give the primary name to the borrowed version, because it's the version that baked-data code will use exclusively e.g. with the normalizer, and the version that will have more usage lines of code even in apps that don't use baked data.

I prefer the constructors for the borrowed types to be on the borrowed types themselves, because e.g. with the normalizer, putting the constructors on the owned types would create a strange reason to have to refer to the owned type in baked-data code when there would be no need to even refer to the owned type in baked-data code if the constructor was on the borrowed type.

I'm currently working on making the collator have a borrowed variant, and the collator is a bit of a special case, because it both has obviously statically-known data (the normalizer data and the root collation) as well as data that gets decided at run time (even if choosing from various baked options).

robertbastian commented 2 months ago

I prefer the constructors for the borrowed types to be on the borrowed types themselves, because e.g. with the normalizer, putting the constructors on the owned types would create a strange reason to have to refer to the owned type in baked-data code when there would be no need to even refer to the owned type in baked-data code if the constructor was on the borrowed type.

This is why we have discussed having the constructor on both. Having it only on the borrowed type means the rustdoc for the owned type (which I consider the main entry type right now) would only show 3/4 constructors. If it is needed on the borrowed type, e.g. because the owned type is cfg-ed, then we can have it on both.

Manishearth commented 2 months ago

(which I consider the main entry type right now)

A discussion I had with @sffc the other day was whether we should be changing this. For singletons I definitely feel like we should be doing FooOwned and Foo, with the borrowed version being the main entry type.

For non-singletons the discussion gets more interesting: do we think most users are going to be using compiled data? if so, the borrowed version still could be the primary entry point with the nicer name, and FooOwned becomes "nice helper type if you need Yoke behavior"

robertbastian commented 2 months ago

What would be cool is if there was only a single type (borrowed), the compiled constructor would return Foo<'static>, and the owned constructors would return something like Owned<Foo<'static>>, which would deref to Foo<'static>.

Manishearth commented 2 months ago

That's kind of what Yoke can do.

It wouldn't be able to deref to Foo<'static> though.

sffc commented 2 months ago

I don't think there's generally a way to make Owned<Foo<'static>> because

  1. There could be multiple DataPayload fields and some fields that are not DataPayload
  2. Might require map_project because the borrowed type might smaller than the data struct

My position is that I think we should move toward having FooData (or FooOwned) and Foo<'a> everywhere, even on non-singleton types, but not in 2.0 because that is a really big change.

A middle ground would be to use this naming scheme only for types that currently have owned and borrowed variants, but I think this middle ground is bad because:

  1. Creates two highly arbitrary classes of types, when really all ICU4X types have generally the same behavior, just some have the Borrowed optimization applied to them
  2. We might migrate more things to be owned/borrowed in 2.x, and now we have a patchwork of things with different names and different meanings

To resolve this thread, I am happy with "Option 2 (2024-09-03)"

sffc commented 2 months ago

Observation: if we have Foo::new() -> Foo and then we migrate Foo to have a borrowed variant, we won't be able to change the function signature in 2.x. We'd need to deprecate that function and create Foo::new_borrowed() or something.

hsivonen commented 2 months ago

I think it's unfortunate if we don't end up switching from FooBorrowed and Foo to Foo and FooOwned (or FooData) before 2.0, but also I'm not volunteering to make the change.

sffc commented 1 month ago

I don't see a clear consensus on this thread and it seems we are not on the same page so I tagged it with discuss-priority

sffc commented 1 month ago

@zbraniecki should weigh in here, too.

sffc commented 1 month ago

Some discussion:

robertbastian commented 3 weeks ago

Discussion with @sffc @zbraniecki @robertbastian

In the short term:

LGTM: (@sffc) @robertbastian (@zbraniecki)

Could consider for later:

LGTM: @sffc @robertbastian

Manishearth commented 3 weeks ago

LGTM on both. Would really like to make compiled static data be the happy path