Data key thoughts - Githubissues

unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.

https://icu4x.unicode.org

Other

1.39k stars 180 forks source link

Data key thoughts #4991

Open robertbastian opened 6 months ago

robertbastian commented 6 months ago

Current architecture:

pub trait DataMarker { ... }
pub trait DynamicDataProvider<M: DataMarker> { load(&self, DataKey, DataRequest) }

pub trait KeyedDataMarker: DataMarker { const KEY: DataKey; }
pub trait DataProvider<M: KeyedDataMarker> { load(&self, DataRequest) }

Data keys do not appear anywhere in the DataProvider paths, they are only used in DynamicDataProvider
- BufferProvider/AnyProvider are functionally DynamicDataProvider<BufferMarker/AnyMarker>
Conceptually, they can be seen as dyn KeyedDataMarker
Users never see data keys, in the API bounds they see KeyedDataMarkers
- Yet, we make them select data keys in datagen

Proposal 1: Align data marker names with data provider names

pub trait DynamicDataMarker { ... }
pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataKey, DataRequest) }

pub trait DataMarker: DynamicDataMarker { const KEY: DataKey; }
pub trait DataProvider<M: DataMarker> { load(&self, DataRequest) }

Proposal 2: Get rid of "data key" terminology

Instead, call it what it really is: an erased DataMarker (new terminology)

pub trait DynamicDataMarker { ... }
pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataMarkerInfo, DataRequest) }

pub trait DataMarker: DynamicDataMarker { const INFO: DataMarkerInfo; }
pub trait DataProvider<M: DataMarker> { load(&self, DataRequest) }

This also includes replacing any "data key" terminology in datagen by "data marker", as we'll have DataMarkerInfos on the API

Proposal 3: `DataMarkerInfo.path` becomes `DataMarkerInfo.name`

To me the utility of the path is not clear (it probably originated from FS provider)
Instead of the paths we store the name of the DataMarker (new terminology) in the tagged string.
If a client sees AndListV1Marker on the API, they can tell datagen to generate "AndListV1Marker" (currently they have to figure out that it's "list/and@1")
In error messages, the marker name is more useful than the path, because mapping this in reverse is basically impossible for clients (and I struggle with this as well)
Blob provider doesn't use the path
Baked provider uses it for macro names, but it should have just used the marker names
FS provider uses this for directory structure, but that's not a functional requirement; data can live at AndListV1Marker/en.json instead of list/and@1/en.json.
We might want to use full paths for the name (icu_list::provider::AndListV1Marker), FS provider can then do icu_list/AndListV1Marker/en.json

Proposal 4: Split `DataMarkerInfo`'s fields (maybe)

pub trait DataMarker: DynamicDataMarker { 
  const NAME: DataMarkerName;
  const IS_SINGLETON: bool;
  const FALLBACK_CONFIG: LocaleFallbackConfig;
}

Having IS_SINGLETON closer to the type system could allow further optimisations
- Might require pub trait SingletonDataMarker: DataMarker instead
The NAMEs can be optimised out if nothing uses them. The hash would live inside the name, so blob provider would still use the name, and static slicing would still work. However, in non-dynamic-data-provider binaries without debug logging, the names can be removed
In the dynamic code path, we'd need to replace the DataKeyInfo parameter by three parameters, so that's a downside to proposal

sffc commented 6 months ago

Initial thoughts:

OK with Proposal 1
This is a somewhat radical change to the mental model, but I can buy into it because I've never been a huge fan of the term "key", and you're right that "data marker" and "data key" refer to very similar concepts
I agree with the problems you laid out, but I also feel that the directory-like structure has been useful to us. I think if we did Proposal 3 we should do the variant where we use the fully qualified marker name, maybe eliding the "provider" since it will be the same for all markers. icu_list/AndListV1Marker
I think I do prefer inlining the fields to the DataMarker trait directly.

sffc commented 6 months ago

We could have both inlined const fields and a helper DataMarkerInfo

pub trait DataMarker: DynamicDataMarker { 
  const NAME: DataMarkerName;
  const IS_SINGLETON: bool;
  const FALLBACK_CONFIG: LocaleFallbackConfig;
}

#[non_exhaustive]
pub struct DataMarkerInfo {
  pub name: DataMarkerName;
  pub is_singleton: bool;
  pub fallback_config: LocaleFallbackConfig;
}

impl DataMarkerInfo {
  pub fn from_marker<M: DataMarker>() -> Self { /* ... */ }
}

pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataMarkerInfo, DataRequest) }

But I guess if you do that you might as well just make DataMarkerInfo a const field, as in Proposal 3.

sffc commented 6 months ago

@Manishearth - Overall I like merging the names. But the "Marker" name is overloaded.
@sffc - Marker is a common name that's used for the type-system level objects in Rust. I think there's some risk of overloading "marker" but I'm not too worried. THere are ways to make clear whether a thing is a DataMarker or some other type of marker.
@sffc - Maybe we can talk about the path structure.
@robertbastian - https://github.com/unicode-org/icu4x/blob/main/provider/datagen/src/registry.rs
@sffc - The Rust types already have a hierarchy. It would be nice to reflect that hierarchy in the string representation. But we can probably cut off the icu:: and the ::provider:: and then snake case the rest.
@robertbastian - I don't like modifying the rust path, because if you then get such a string in an error message, you still have to reverse the mapping, even if it's a more regular mapping than now
@robertbastian - We can use the Rust stringify! macro if we keep the exact same path. It also makes them more searchable.
@robertbastian - There's also https://github.com/unicode-org/icu4x/issues/4193

Summary of additional discussion later:

@robertbastian values converging on a single identifier. The dueling identifiers (Rust type name and string path) harm understandability, usability, and searchability.
@sffc is favorable toward using a single identifier. @Manishearth is neutral or slightly favors having two identifiers since they service different use cases.
@sffc values hierarchy and brevity. The hierarchy enables use cases such as filtering and diagnostics, and it makes it easier and less error-prone to write custom datagen logic. The brevity directly impacts binary size.
@sffc puts a higher value on these properties than on the Rust names and would prefer names such as DateTime_Year_Chinese_V1 which are both hierarchical and brief, despite not being conventional Rust names. @sffc argues that these are weird types that only appear in trait bounds so it is fine to establish a new paradigm for how we name them. "V1Marker" is already a paradigm; we're swapping one for another.
@robertbastian and @Manishearth put a higher value on Rust conventional type names. @robertbastian points out that filtering can still be implemented using custom application logic to parse the camel case. @Manishearth acknowledges that filtering gets more difficult.
@robertbastian proposed icu::datetime::data::year::ChineseV1 as the identifier. @sffc is potentially okay with this compromise, since it has hierarchy and filterability, but is disappointed that it sacrifices brevity.

Manishearth commented 6 months ago

But the "Marker" name is overloaded

to expand on this a little bit (but not to rabbithole too deep because I don't care about "Marker" strongly): I find Marker to be a good default, but it's nice when it can be replaced with a useful concept name instead. I think we may have options here.

one thing I proposed in that discussion that isn't included here was encoding the heirarchy in the trait, so you can have something like:

impl DataMarker for DaySymbolsV1Marker {
   const COMPONENT = "datetime";
   const SUBCOMPONENT = Some("symbols");
}

sffc commented 5 months ago

Notes from the scratch pad:

pub fn try_new_unstable<D>(
    provider: &D,
    locale: &DataLocale,
    options: CollatorOptions
) -> Result<Self, DataError>
where
    D: DataProvider<CollationSpecialPrimariesV1Marker> + DataProvider<CollationDataV1Marker> + DataProvider<CollationDiacriticsV1Marker> + DataProvider<CollationJamoV1Marker> + DataProvider<CollationMetadataV1Marker> + DataProvider<CollationReorderingV1Marker> + DataProvider<CanonicalDecompositionDataV1Marker> + DataProvider<CanonicalDecompositionTablesV1Marker> + ?Sized,

pub fn try_new_unstable<D>(
    provider: &D,
    locale: &DataLocale,
    options: CollatorOptions
) -> Result<Self, DataError>
where
    D: DataProvider<Collation_SpecialPrimaries_V1> + DataProvider<Collation_Data_V1> + DataProvider<Collation_Diacritics_V1> + DataProvider<Collation_Jamo_V1> + DataProvider<Collation_Metadata_V1> + DataProvider<Collation_Reordering_V1> + DataProvider<Collator_CanonicalDecomposition_Data_V1> + DataProvider<Collation_CanonicalDecomposition_Tables_V1> + ?Sized,

pub fn try_new_unstable<D>(
    provider: &D,
    locale: &DataLocale,
    options: FixedDecimalFormatterOptions
) -> Result<Self, DataError>
where
    D: DataProvider<DecimalSymbolsV1Marker> + ?Sized

pub fn try_new_unstable<D>(
    provider: &D,
    locale: &DataLocale,
    options: FixedDecimalFormatterOptions
) -> Result<Self, DataError>
where
    D: DataProvider<Decimal_Symbols_V1> + ?Sized

Collation_CanonicalDecomposition_Tables_V1
TimeZone_IanaMapper_ToBcp47_V2 or TimezoneIanamapperTobcp47V1Marker
TimeZone_IanaMapper_ToIana_V1
TimeZone_Metazone_Period_V1
TimezoneIanaToBcp47V1, TimezoneBcpToIanaV1, icu::timezone::provider::iana_mapper::ToIanaV1Marker

Current: DangiDatePatternV1Marker + “datetime/patterns/dangi/date@1” Shane Proposes: DateTime_Patterns_Date_Dangi_V1 Rob Proposes: DatetimePatternsDateDangiV1Marker, icu::datetime::provider::patterns::date::DangiV1Marker

where: P: DataProvider<patterns::date::DangiV1Marker> + DataProvider<patterns::date::GregorianV1Marker> ...

where: P: DataProvider<DateTime_Patterns_Date_Dangi_V1> + DataProvider<DateTime_Patterns_Date_Gregorian_V1> ...

icu::datetime::provider::patterns::date::DangiV1Marker

datetime/patterns/date/DangiV1Marker/en.json

datetime/patterns/date/dangi@1/en.json

sffc commented 2 months ago

Discussion:

@Manishearth Users typically don't use the string key, but it has its use cases and might show up in errors from time to time. I think the status quo is fine. I don't think it's a good idea to make our data loading path depend on the Rust path. I want to be able to move things around in code without changing the data path. I think we've made decent choices in each case.
@hsivonen Do the string keys get stored as the full path somewhere, so if the string key is even longer, we take a binary size hit?
@robertbastian In the latest proposal, ... we use the hash more. The places these show up are in error messages and in fs.
@robertbastian The reason I proposed this is that it is confusing to have to IDs for the same thing. They see in docs ListV1Marker but in datagen they need to select list/and@1. I fixed this by accepting both formulations in datagen. But they still see the path3 in errors.
@sffc We could/should do an audit of our marker and path names. We've landed them but never really took a holistic view. Then we could establish a convention between the Rust IDs and the string IDs.
@sffc If datagen accepts marker names, then the marker names are effectively stable.
@robertbastian Doesn't seem 2.0 critical, except for the audit of the paths.
@robertbastian I think we should consider anything around markers to be unstable.

Some proposals:

Proposal A: Consider marker names to be stable. They can't be renamed in minor releases. They are accepted in datagen.
Proposal B: Consider marker names to be unstable. They can be renamed as we like. The paths remain stable. They are accepted in datagen because if you use the marker names, you are subject to their stability policy.
Proposal C: Same as Proposal B but reject marker names in datagen.

More discussion:

@robertbastian We've never needed to rename a marker. I don't see why we should make our UX worse to reserve this possibility.
@sffc I'm leaning toward Proposal B but not actually renaming anything until we do a proper audit, and in some future release we can do that audit. If we do it in say 2.2, then the 2.2 release notes say so.
@robertbastian People don't usually read point release notes. Things just shouldn't break. If the audit is not important enough for 2.0, maybe it's just not important
@sffc Data slicing and data providers are part of the ICU4X value proposition, so we should make them nice. If we had more resources, I would like to do it in 2.0, but I agree it's a lower priority.
@Manishearth If you're using them, you're using them. No one looks at them holistically other than us.
@robertbastian We should at least do an audit in datetime because they live 5 traits away and the documentation should explain how to figure out which markers you'll need

Conclusion:

Don't rename markers in 2.x. Accept them in datagen.
Could revisit in 3.0 but it might not be a high priority even then.
Perform an audit on neo datetime markers.
Add this auditing to the graduation checklist. For now it is a vague requirement; could evolve into something more specific in the future.

LGTM: @sffc @robertbastian @Manishearth

sffc commented 6 days ago

I am assigned this issue for 2.0 to do an audit of datetime markers. [^1]

The audit, whether in icu_datetime or elsewhere, brings back the data marker naming discussion where we are at an impasse (the lack of an option that has no veto); see https://github.com/unicode-org/icu4x/issues/5276. I do not know how I am expected to perform the audit without having a consensus on how to name things.

I would like to therefore propose the following rules which largely encode the status quo but make a few changes that better enforce consistency and circumvents the casing controversy by avoiding types that would trigger it, like we did for datetime field sets.

Thing	Description	Naming Rule	Example: icu_decimal	Example: icu_datetime
Data struct	A data struct that is the `Yokeable` of a data marker	Based on the data it holds and/or the context in which it is used, with version number, without crate namespace	SymbolsV1	MonthNamesV1
Helper type	A struct or enum that is a field of a data struct but is not itself the `Yokeable` of a data marker	Based on the data it holds and/or the context in which it is used, without version number, with optional crate namespace	DecimalSymbolStrs	PackedPluralElements
Data marker	A non-exhaustive unit struct used in `try_new_unstable` function bounds	Based on the associated data struct. No crate namespace. Suffixed with "Marker". May add an infix before the version number if there are multiple markers using the same data struct. Usually exported directly from the `provider` module in the crate.	SymbolsV1Marker	MonthNamesGregorian V1Marker
Data path	A string associated with a data marker which forms the basis for the hash key and filesystem paths	Directly derived from the data marker name by converting it to snake case, changing `V1Marker` to `@1`, and adding the crate name prefix without `icu_`. If there is an infix (which we can detect by comparing the data struct name to the data marker name), the infix is separated by a slash instead of an underscore in the path.	decimal/symbols@1	datetime/month_names/ gregorian@1
Bounds in ctors	The type that is written in the bound of `try_new_unstable` and shows up in docs	If the marker is in the same crate as the ctor, import it. If the marker is in a different crate, use the fully qualified path.	icu_decimal::provider:: SymbolsV1Marker (fully qualified only if cross-crate)	icu_datetime::provider:: MonthNamesGregorian V1Marker (fully qualified only if cross-crate)
Datagen arguments	How data markers can be specified as arguments to icu4x-datagen	Path names only. No marker names.	decimal/symbols@1	datetime/month_names/ gregorian@1

There's a lot to digest in that table, but I think we can only have a productive discussion holistically.

I'll highlight a few key changes in this proposal:

Marker names no longer have a crate prefix. This is done to circumvent the casing controversy.
As a result, marker names are no longer accepted in icu4x-datagen, rolling back https://github.com/unicode-org/icu4x/pull/5060
A handful of types will undergo name changes in this proposal

Obviously @robertbastian isn't here to give feedback on this, but I don't know any other solution that is actionable. The alternative is "drop the datetime audit and leave everything as-is to revisit in 3.0", acknowledging that the status quo is a bit of a mess but that the TC was unable to reach consensus.

[^1]: There aren't that many data markers. The vast majority are the ones in the datetime crate, which we already slated for the audit, and the properties markers, which are generated by macros. So, I'm volunteering to do the audit for all crates in 2.0.

Manishearth commented 6 days ago

I'm in favor of this. I don't necessarily think we need to require qualification for cross crate stuff. Currently because of provider it will lead to long icu_decimal::provider::SymbolsV1 imports which isn't ... great. But it's ultimately fine.

sffc commented 6 days ago

I'm in favor of this. I don't necessarily think we need to require qualification for cross crate stuff. Currently because of provider it will lead to long icu_decimal::provider::SymbolsV1 imports which isn't ... great. But it's ultimately fine.

Part of my thinking was:

We don't have that many cross-crate bounds
The cross-crate bounds only show up in try_new_unstable<P>(...) where P: ..., not in try_new() with crate::provider::Baked: ...
It's kind-of nice that cross-crate is more explicit since it is more likely to require special handling, such as the user needing to fork between multiple bake providers

sffc commented 6 days ago

Thought: we could make the infixed marker names be MonthNamesV1GregorianMarker. Then the path could always be derived directly from the marker name and vice-versa, without having to know the data struct name.

icu_foobar::provider::HelloWorldV1Marker <=> foobar/hello_world@1

icu_foobar::provider::HelloWorldV1LoremIpsumMarker <=> foobar/hello_world/lorem_ipsum@1

In other words, it would be possible to write a fn that maps from type_name::<M>() to the path name and vice-versa. I think this is one of the things @robertbastian would have liked to have seen in a solution here.

Manishearth commented 6 days ago

@sffc "we don't have many cross crate bounds" sells me. And yeah I don't mind the explicitness too much.

Thought: we could make the infixed marker names be MonthNamesV1GregorianMarker. Then the path could always be derived directly from the marker name and vice-versa, without having to know the data struct name.

Yeah this makes a lot of sense to me! I think we prefixed it because we always did, but I actually don't see a strong reason to prefix.

sffc commented 6 days ago

Yeah this makes a lot of sense to me! I think we prefixed it because we always did, but I actually don't see a strong reason to prefix.

Just highlighting that my last reply above was to put the infix between the "V1" and the "Marker"?

The rule for naming a data marker would be: [Struct]V[Version][Infix]Marker

If we do [Struct][Infix]V[Version]Marker, it is not as easy to figure out what part of the name is Struct and which part is Infix, which matters for the path conversion.

Manishearth commented 6 days ago

Just highlighting that my last reply above was to put the infix between the "V1" and the "Marker"?

Oh, I see. Hmmmmm. I'd rather have it always be V1Marker but I'm not that picky about it.

sffc commented 6 days ago

Just highlighting that my last reply above was to put the infix between the "V1" and the "Marker"?

Oh, I see. Hmmmmm. I'd rather have it always be V1Marker but I'm not that picky about it.

I had the thought, "isn't it nice to search for V1Marker and find everything", but then I realized that there are V2Marker, V3Marker, etc. With this new naming convention, you can still grep for V\d\w*Marker.

sffc commented 6 days ago

Also, a possible advantage of this naming scheme is that the data struct name is fully a substring of the data marker name. If the data struct is SymbolsV1, you either stuff Marker or InfixMarker at the end of it, but the prefix will always be the data struct name.

unicode-org / icu4x

Data key thoughts #4991

Proposal 1: Align data marker names with data provider names

Proposal 2: Get rid of "data key" terminology

Proposal 3: DataMarkerInfo.path becomes DataMarkerInfo.name

Proposal 4: Split DataMarkerInfo's fields (maybe)

Proposal 3: `DataMarkerInfo.path` becomes `DataMarkerInfo.name`

Proposal 4: Split `DataMarkerInfo`'s fields (maybe)