unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 175 forks source link

Remove Default impls that load data #5554

Open sffc opened 1 month ago

sffc commented 1 month ago

Currently we have a number of impl Default on classes that have a new() function that load data. The data is usually singleton but doesn't need to be (see SentenceSegmenter for instance).

This is bad because:

  1. The contract of impl Default is to fill in default values for fields when there are reasonable default values. But these are opaques, not bag-of-field structs, and impl Default is just not in the right spirit.
  2. The impl Default is gated on the compiled_data feature, which is dangerous since it can be disabled.
  3. impl Default might do work, since it is loading data. It should ideally exist only when POD can be filled in.
  4. It's also very confusing and misleading to allow anyone to write code such as that in #5549, where an unnamed positional argument that isn't an options bag is elided with &Default::default().

We never had these impls until we added compiled_data constructors, and then we added them only because Clippy was telling us to. I don't think we ever really had a thorough discussion about it.

I hope this is a no-brainer and non-controversial.

@Manishearth @robertbastian @zbraniecki @hsivonen

robertbastian commented 1 month ago
  1. The contract of impl Default is to fill in default values for fields when there are reasonable default values. But these are opaques, not bag-of-field structs, and impl Default is just not in the right spirit.

That's not what that says. It says

A trait for giving a type a useful default value.

Sometimes, you want to fall back to some kind of default value, and don’t particularly care what it is. This comes up often with structs that define a set of options

"This comes up often with structs that define a set of options", but is in no way limited to that. Compiled data is reasonable default data, that's why we're publishing it.

  1. The impl Default is gated on the compiled_data feature, which is dangerous since it can be disabled.

So? Some APIs are gated, and that includes impls.

  1. impl Default might do work, since it is loading data. It should ideally exist only when POD can be filled in.

It only exists when new has no arguments. In that case we're in the singleton path and there is no work.

  1. It's also very confusing and misleading to allow anyone to write code such as that in https://github.com/unicode-org/icu4x/pull/5549, where an unnamed positional argument that isn't an options bag is elided with &Default::default().

Call sites can be written using Foo::default() instead. This applies the same way to all types implementing Default: you can write &mut Default::default() when you need to pass a &mut Vec<T>, that's not a reason for Vec not to implement Default.

robertbastian commented 1 month ago

The issue title misrepresents what's happening. There is never any "data loading" in a Default impl. Even in SentenceSegmenter which you use as an example, there is not:

impl Default for SentenceSegmenter {
    fn default() -> Self {
        Self::new()
    }
}

impl SentenceSegmenter {
    pub fn new() -> Self {
        Self {
            payload: DataPayload::from_static_ref(
                crate::provider::Baked::SINGLETON_SENTENCE_BREAK_DATA_V2_MARKER,
            ),
            payload_locale_override: None,
        }
    }
}
Manishearth commented 1 month ago

I don't think I agree with any of these points.

  • The contract of impl Default is to fill in default values for fields when there are reasonable default values. But these are opaques, not bag-of-field structs, and impl Default is just not in the right spirit.

I don't agree with this. I've seen Default for all kinds of types, and the docs seem to agree with that idea, they just give "bag-of-stuff" structs as a prominent example.

  • The impl Default is gated on the compiled_data feature, which is dangerous since it can be disabled.

Seems fine. A lot of things are gated on compiled_data. I don't see Default as special. If people are using the Default impl and it gets disabled, in a different world they'd be calling ::new() and having it be disabled.

  • impl Default might do work, since it is loading data. It should ideally exist only when POD can be filled in.

There's no expectation of Default to be cheap.

  • It's also very confusing and misleading to allow anyone to write code such as that in Concept: TimeZone + FormattableTimeZone #5549, where an unnamed positional argument that isn't an options bag is elided with &Default::default().

This is a rather subjective style choice and i don't see a need to force it on our clients. They can always call Normalizer::default() if they wish to be explicit (as can we). I'm generally not a fan of blocking off ways of writing code for clients because of our own style preferences.

sffc commented 1 month ago

Ok so it's not noncontroversial

I'm reasonably convinced by the counter-arguments on points 1. On point 4, I'm mainly concerned when it is used for passing these things as unnamed positional arguments, but we can avoid that by not adding such APIs. I'm not convinced by the counter-arguments on points 2 and 3, and there are more I'd like to add:

  1. The impl Default is gated on the compiled_data feature, which is dangerous since it can be disabled.

I'll add to this: the fact that it is gated on compiled_data means that it is not a universal fixed default! Like, I don't want people's Default::default() to break when they enable --no-default-features. Default is sometimes used in bounds for other functions, too. Programmers can write code in a style that makes it difficult for them to break out of using the Default impl for a particular type.

  1. impl Default might do work, since it is loading data. It should ideally exist only when POD can be filled in.

I used SentenceSegmenter as an example because it has two fields, one singleton and one non-singleton. Currently the non-singleton has no data in und, but if it did, it would need to be assembled. I recall seeing some code landing in maybe Collator or Normalizer that loaded und data from a key with non-singleton data in a const constructor.

To respond to @Manishearth's "There's no expectation of Default to be cheap": my expectation is that it is cheap. For example, it is used in bounds on things like constructing an array and other operations that where the impl is called a large number of times. Is your argument an "in my experience" argument or "this is a documented convention" argument?

New points:

  1. These Default impls are not stable and are not in fact universal default values. They depend on data version, producing different behavior under cargo update. The fact that they only exist under compiled_data could be used as evidence that we believe that these values are not universal.
  2. Not all types have a pub fn new(). There's nothing "special" separating types that support pub fn new() and pub fn new(&DataLocale). It is inconsistent to have these impls on some types but not others. We could add impl Default that loads data for "und" on all other types to add consistency, but that drives home the point that these should not be considered universally default values.
  3. We never approved as a group of adding the Default impls; we only approved adding pub const fn new() functions. So I would really like to take this discussion from the angle of "we don't have these impls; are they motivated to add?" Other than the clippy lint, what is the motivation?
sffc commented 1 month ago

On the point 7: we have generally been conservative with adding impls to our types, and I think it has served us well. We have a very large API surface, and every impl adds cost. The only impl we agreed to universally add has been impl Debug. We add other impls such as PartialEq, Ord, Hash, Clone, ... only when independently motivated. That yardstick of "independently motivated" should apply here, too.

Manishearth commented 1 month ago

I'll add to this: the fact that it is gated on compiled_data means that it is not a universal fixed default! Like, I don't want people's Default::default() to break when they enable --no-default-features. Default is sometimes used in bounds for other functions, too. Programmers can write code in a style that makes it difficult for them to break out of using the Default impl for a particular type.

Right, but if they're doing that they presumably need some default, without Default they'd be calling ::new().

"There's no expectation of Default to be cheap": my expectation is that it is cheap.

Okay, you should not hold on to that expectation because it is not one followed by the ecosystem.

My argument is one based on the ecosystem, and the fact that it is not documented as being cheap. I would consider this something that needs to be explicitly documented were it an expectation.

Worth noting that our Default impls still are pretty cheap, baked data loading is extremely fast. They're just not necessarily entirely devoid of runtime behavior.

5. These Default impls are not stable and are not in fact universal default values. They depend on data version, producing different behavior under cargo update. The fact that they only exist under compiled_data could be used as evidence that we believe that these values are not universal.

I don't consider this to be a requirement of the Default trait either.

  • Not all types have a pub fn new(). There's nothing "special" separating types that support pub fn new() and pub fn new(&DataLocale). It is inconsistent to have these impls on some types but not others. We could add impl Default that loads data for "und" on all other types to add consistency, but that drives home the point that these should not be considered universally default values.

I'm open to having some types have pub fn new_und() / pub fn new(&DataLocale) or something and not having Default on them. I think this argument works for those types but not for the general case of singletons.

7. We never approved as a group of adding the Default impls; we only approved adding pub const fn new() functions. So I would really like to take this discussion from the angle of "we don't have these impls; are they motivated to add?" Other than the clippy lint, what is the motivation?

I can provide the motivation for the clippy lint: Default impls enable a whole bunch of convenient functions like mem::take and stuff. It's annoying when you're working with a type that could have a Default but chooses not to.

I definitely do not think this is a case in which the clippy lint is wrong. My position is that deviating from clippy here is something that needs justification, not the new impl.

sffc commented 1 month ago

I can provide the motivation for the clippy lint: Default impls enable a whole bunch of convenient functions like mem::take and stuff. It's annoying when you're working with a type that could have a Default but chooses not to.

Example developer journey: they are using postcard data. They apply mem::take to a field involving a property value or some other type implementing Default. Everything compiles. Then two months later they realize that they should disable the compiled_data feature. But doing so causes mem::take to fail to compile. They don't fully understand why so they leave it alone. In the end, they are linking both compiled data and postcard data, all implicitly.

Manishearth commented 1 month ago

Then two months later they realize that they should disable the compiled_data feature. But doing so causes mem::take to fail to compile.

This ... seems like exactly what we want to happen? They now are forced to consider where their data should come from. That's good. That should exactly be a factor of the decision to disable the feature.

I don't really think "they don't fully understand so they leave it alone" is that likely. It doesn't seem that confusing. We could add a "enabled by feature foo" tag on the impl if we really want to improve this experience.

sffc commented 1 month ago

This ... seems like exactly what we want to happen? They now are forced to consider where their data should come from. That's good. That should exactly be a factor of the decision to disable the feature.

My point is that they should be forced to consider where their data comes from up front by using an API explicitly documented as "with compiled data", which was key to us agreeing to have the compiled data functions have short, convenient names.

sffc commented 1 month ago

We could add a "enabled by feature foo" tag on the impl if we really want to improve this experience.

The problem with docs on trait impls is that they are not discoverable. Often, Default is used implicitly, as you noted with mem::take, so these docs won't get read often, and by the time they do, the programmer has already accidentally linked the compiled data implicitly.

Manishearth commented 1 month ago

The problem with docs on trait impls is that they are not discoverable

I mean, if you get a "missing Default impl" when you play with features I don't really see why it would be hard to figure out what happened. The docs on the trait impl are just an added bonus.

I'm not convinced of the sketched developer story being particularly onerous.

the programmer has already accidentally linked the compiled data implicitly.

I think that's fine? For singletons if someone is using a type they almost certainly need the data and 99% of users will get the data from the compiled data; the main benefit of loading data from a file for singletons is if it's some large program that wishes to lazily load everything. I can see a potential additional benefit of wishing to load the latest data.

I want most of our singleton users to be able to just derive(Default) where necessary.

People who care about data to that granularity are going to be caeful about bounds anyway.

Just out of curiosity, this is the current list of gated Default impls:

``` [15:10:32] मanishearth@manishearth-glaptop2 ~/dev/icu4x ^_^ $ rg -U "compiled_data.*\nimpl Default for" components/normalizer/src/uts46.rs 38:#[cfg(feature = "compiled_data")] 39:impl Default for Uts46MapperBorrowed<'static> { 141:#[cfg(feature = "compiled_data")] 142:impl Default for Uts46Mapper { components/locale/src/directionality.rs 40:#[cfg(feature = "compiled_data")] 41:impl Default for LocaleDirectionality { components/locale/src/expander.rs 212:#[cfg(feature = "compiled_data")] 213:impl Default for LocaleExpander { components/normalizer/src/properties.rs 51:#[cfg(feature = "compiled_data")] 52:impl Default for CanonicalCompositionBorrowed<'static> { 122:#[cfg(feature = "compiled_data")] 123:impl Default for CanonicalComposition { 196:#[cfg(feature = "compiled_data")] 197:impl Default for CanonicalDecompositionBorrowed<'static> { 463:#[cfg(feature = "compiled_data")] 464:impl Default for CanonicalDecomposition { 553:#[cfg(feature = "compiled_data")] 554:impl Default for CanonicalCombiningClassMapBorrowed<'static> { 644:#[cfg(feature = "compiled_data")] 645:impl Default for CanonicalCombiningClassMap { components/segmenter/src/grapheme.rs 134:#[cfg(feature = "compiled_data")] 135:impl Default for GraphemeClusterSegmenter { components/segmenter/src/sentence.rs 114:#[cfg(feature = "compiled_data")] 115:impl Default for SentenceSegmenter { components/locale/src/canonicalizer.rs 199:#[cfg(feature = "compiled_data")] 200:impl Default for LocaleCanonicalizer { components/casemap/src/titlecase.rs 207:#[cfg(feature = "compiled_data")] 208:impl Default for TitlecaseMapper { components/casemap/src/closer.rs 43:#[cfg(feature = "compiled_data")] 44:impl Default for CaseMapCloser { components/casemap/src/casemapper.rs 41:#[cfg(feature = "compiled_data")] 42:impl Default for CaseMapper { components/timezone/src/windows_tz.rs 41:#[cfg(feature = "compiled_data")] 42:impl Default for WindowsTimeZoneMapper { components/timezone/src/zone_offset.rs 19:#[cfg(feature = "compiled_data")] 20:impl Default for ZoneOffsetCalculator { components/timezone/src/ids.rs 96:#[cfg(feature = "compiled_data")] 97:impl Default for TimeZoneIdMapper { 468:#[cfg(feature = "compiled_data")] 469:impl Default for TimeZoneIdMapperWithFastCanonicalization { components/timezone/src/metazone.rs 21:#[cfg(feature = "compiled_data")] 22:impl Default for MetazoneCalculator { [15:10:45] मanishearth@manishearth-glaptop2 ~/dev/icu4x ^_^ $ rg -U "compiled_data.*\nimpl Default for" components/normalizer/src/properties.rs 51:#[cfg(feature = "compiled_data")] 52:impl Default for CanonicalCompositionBorrowed<'static> { 122:#[cfg(feature = "compiled_data")] 123:impl Default for CanonicalComposition { 196:#[cfg(feature = "compiled_data")] 197:impl Default for CanonicalDecompositionBorrowed<'static> { 463:#[cfg(feature = "compiled_data")] 464:impl Default for CanonicalDecomposition { 553:#[cfg(feature = "compiled_data")] 554:impl Default for CanonicalCombiningClassMapBorrowed<'static> { 644:#[cfg(feature = "compiled_data")] 645:impl Default for CanonicalCombiningClassMap { components/normalizer/src/uts46.rs 38:#[cfg(feature = "compiled_data")] 39:impl Default for Uts46MapperBorrowed<'static> { 141:#[cfg(feature = "compiled_data")] 142:impl Default for Uts46Mapper { components/segmenter/src/grapheme.rs 134:#[cfg(feature = "compiled_data")] 135:impl Default for GraphemeClusterSegmenter { components/segmenter/src/sentence.rs 114:#[cfg(feature = "compiled_data")] 115:impl Default for SentenceSegmenter { components/locale/src/directionality.rs 40:#[cfg(feature = "compiled_data")] 41:impl Default for LocaleDirectionality { components/locale/src/expander.rs 212:#[cfg(feature = "compiled_data")] 213:impl Default for LocaleExpander { components/locale/src/canonicalizer.rs 199:#[cfg(feature = "compiled_data")] 200:impl Default for LocaleCanonicalizer { components/casemap/src/titlecase.rs 207:#[cfg(feature = "compiled_data")] 208:impl Default for TitlecaseMapper { components/casemap/src/casemapper.rs 41:#[cfg(feature = "compiled_data")] 42:impl Default for CaseMapper { components/casemap/src/closer.rs 43:#[cfg(feature = "compiled_data")] 44:impl Default for CaseMapCloser { components/timezone/src/ids.rs 96:#[cfg(feature = "compiled_data")] 97:impl Default for TimeZoneIdMapper { 468:#[cfg(feature = "compiled_data")] 469:impl Default for TimeZoneIdMapperWithFastCanonicalization { components/timezone/src/windows_tz.rs 41:#[cfg(feature = "compiled_data")] 42:impl Default for WindowsTimeZoneMapper { components/timezone/src/metazone.rs 21:#[cfg(feature = "compiled_data")] 22:impl Default for MetazoneCalculator { components/timezone/src/zone_offset.rs 19:#[cfg(feature = "compiled_data")] 20:impl Default for ZoneOffsetCalculator { ```
sffc commented 1 month ago

I think the fundamental disagreement is: we agree as a TC that offering developers a way to switch data sources is fundamental, but we don't agree on the degree to which opting into specific data sources can and should impact developer ergonomics.

I have taken the position that, given its importance to ICU4X's value proposition, we should generally favor being explicit about data sources, even if it takes a hit on developer ergonomics.

The decision to name constructors .new() instead of .new_with_compiled_data() was contrary to that position, but I agreed to it based on two notions: (1) as the first experience with ICU4X and the primary entrypoint into our APIs, this is a uniquely high-impact area for developer ergonomics, and (2) we document the constructors as saying "with compiled data" in the first line, and always link to the docs page explaining what that means.

If I had to guess, @Manishearth is likely to take the position that Default impls are, indeed, a high-impact area for developer ergonomics. I'm happy to debate that point, but, from my perspective, it is moot, because we don't have an alternative on the table for developers making explicit decisions about data sources. With .new(), we were able to move "with compiled data" from the function name to the docs, which was an acceptable though less-than-ideal alternative. We don't have that luxury with Default because its docs are not discoverable and because it is often used implicitly.

sffc commented 1 month ago

Based on the above, I could agree to this policy:

  1. By default, we don't add Default impls that rely on compiled data, because doing so causes compiled data to be used implicitly, going against the value proposition of the project
  2. A Default impl could be added if one of the following cases can be made:
    • The ICU4X class in question uses data that is uniquely static in nature, such that a typical user would choose to use compiled data for it even if they use dynamic data in other parts of their application. (Potential example: stable properties, maybe other properties-driven things like normalizer)
    • The ICU4X class in question is uniquely commonly used in contexts where a Default impl is required for ergonomics. (I can't think of examples of this right now)
Manishearth commented 1 month ago

TLDR: I think the proposed policy works but I think we should talk about what we think of as singletons here a bit more


I think the fundamental disagreement is: we agree as a TC that offering developers a way to switch data sources is fundamental, but we don't agree on the degree to which opting into specific data sources can and should impact developer ergonomics.

Okay, thanks for sketching this out, and for the proposed policy.

I think I see the disagreement here, and it's subtler and in two parts.

ICU4X's initial design didn't include baked data, and as such it was extremely important we gave people a lot of control over switching data sources. Of course, control over data is a core value proposition, but the way in which that control works has changed over time.

Specifically, with the advent of baked data, we've gone out of our way to make baked data easy to use. We could have made baked data need a BakedProvider, but instead we gave everyone convenient constructors, to the extent of sometimes using special Borrowed types to make the baked experience even better. Now, this choice does not necessarily imply that we wish to make baked "as easy to use as possible", there can of course be limits, but it was a signal.

The impression I had, at least from the way prior discussions have gone and especially points made by @robertbastian around the topic of baked data, was that for singletons at least we no longer considered control over data to be so central to the design, mostly because "it makes sense". Rather, we consider it crucial that there be some way to have control over data, and implicitly recognized that people who want control over data for singletons were an ultimately niche (but important!) use case.

This was the position I was coming from when recently discussing https://github.com/unicode-org/icu4x/issues/5440, where my preference was that the "main entry type" for singletons be the baked, borrowed one, with the owned version having the qualified name because it was more niche.

Anyway, hopefully that explains where I stand on "offering developers a way to switch data is fundamental". I think it's fundamental, as you said, but I think specifically for singletons, it is a niche use case. I was also under the impression that this was where everyone else roughly stood, at the very least @robertbastian but I think I've expressed opinions based on this in the past and not received pushback from others. It's potentially worth reaffirming precisely where we stand there. I'm not opposed to a position that data loading is non-niche for every type, that's just not my default (heh) position right now

Secondly, and here is where it gets a bit interesting, I think perhaps we've been dancing around the topic of what is truly a "singleton". Previously in the discussion you brought up SentenceSegmenter and I said:

I'm open to having some types have pub fn new_und() / pub fn new(&DataLocale) or something and not having Default on them. I think this argument works for those types but not for the general case of singletons.

which I was me considering SentenceSegmenter to not be a singleton, though not saying so explicitly perhaps. It's possible that a lot of our types with Default fall into a similar bucket! As I mention later below, I think timezone stuff may be similar.

In your new proposal, you say this:

The ICU4X class in question uses data that is uniquely static in nature, such that a typical user would choose to use compiled data for it even if they use dynamic data in other parts of their application. (Potential example: stable properties, maybe other properties-driven things like normalizer)

I think, to me, this is where I'm coming from when I'm talking about "singletons", both previously in this thread and in the first half of this post. True singletons, to me, are ones where there is only one possible "best" value, which means runtime data loading is going to be extremely rare. The main use cases for those are:

I feel like properties, casemapper, and the auxiliary normalizer types fit in this . The main normalizer types have a split NFC/NFKC constructor, they're ineligible. The properties set types also are not eligible, because we type erase them and have a generic constructor. Locale directionality probably fits here too, and ... maybe segmenters? Probably not? I'm not sure about that one but curious as to what others think.

I think timezone stuff doesn't really fit this: I think "load timezone from system tz files" is going to be a very common use case.

There are more in the list I posted, but this should be a rough sketch of my position. I suspect a productive next step for us is to agree on the policy you stated and work on figuring out what it actually means for something to be a "true singleton" (yes, I recognize that this is a confusing distinction I have potentially just made up, I'd love to hear terminology ideas)

sffc commented 1 month ago

Some discussion:

sffc commented 1 month ago

Here's another point. A Default impl could be one of two things: with-default-data or without-data. We have a small number of classes, such as LocaleFallbacker, that work in both modes.

The argument "it has a new function so clearly there is a default value" is really precarious to apply here because I still stand by my position that ::new() in a pure-principles world would have been named ::new_with_compiled_data() but due to other factors in play we decided on ::new(), but not because it was always considered to be the default value.

Manishearth commented 1 month ago

it has a new function so clearly there is a default value

I think this is a misrepresentation of my position, I don't think you're arguing against something anyone else here agrees with. I explicitly said early in the discussion:

I'm open to having some types have pub fn new_und() / pub fn new(&DataLocale) or something and not having Default on them. I think this argument works for those types but not for the general case of singletons.

I think those are some cases where we should perhaps not have a fn new() in the first place (as you say), but even if we do it is one where I agree there is no single sensible Default.

robertbastian commented 2 weeks ago

Two positions:

More discussion:

Types Shane thinks should not be defualt:

What if: for these types we should call the functions new_something or have options on the new().

(@sffc and @manishearth like this)

Conclusion:

On a case by case bases, if a thing is in a situation where it is:

we do not provide a Default. We also strongly consider not providing a new(/* no arguments */). We may instead provide a new_foo() or a new(options).

Concrete next steps for 2.0:

Agreed: @Manishearth @sffc (@robertbastian) @zbraniecki

Additional action items:

Discussion: