Open robertbastian opened 6 months ago
Initial thoughts:
icu_list/AndListV1Marker
DataMarker
trait directly.We could have both inlined const fields and a helper DataMarkerInfo
pub trait DataMarker: DynamicDataMarker {
const NAME: DataMarkerName;
const IS_SINGLETON: bool;
const FALLBACK_CONFIG: LocaleFallbackConfig;
}
#[non_exhaustive]
pub struct DataMarkerInfo {
pub name: DataMarkerName;
pub is_singleton: bool;
pub fallback_config: LocaleFallbackConfig;
}
impl DataMarkerInfo {
pub fn from_marker<M: DataMarker>() -> Self { /* ... */ }
}
pub trait DynamicDataProvider<M: DynamicDataMarker> { load(&self, DataMarkerInfo, DataRequest) }
But I guess if you do that you might as well just make DataMarkerInfo
a const field, as in Proposal 3.
icu::
and the ::provider::
and then snake case the rest.stringify!
macro if we keep the exact same path. It also makes them more searchable.Summary of additional discussion later:
DateTime_Year_Chinese_V1
which are both hierarchical and brief, despite not being conventional Rust names. @sffc argues that these are weird types that only appear in trait bounds so it is fine to establish a new paradigm for how we name them. "V1Marker" is already a paradigm; we're swapping one for another.icu::datetime::data::year::ChineseV1
as the identifier. @sffc is potentially okay with this compromise, since it has hierarchy and filterability, but is disappointed that it sacrifices brevity.But the "Marker" name is overloaded
to expand on this a little bit (but not to rabbithole too deep because I don't care about "Marker" strongly): I find Marker to be a good default, but it's nice when it can be replaced with a useful concept name instead. I think we may have options here.
one thing I proposed in that discussion that isn't included here was encoding the heirarchy in the trait, so you can have something like:
impl DataMarker for DaySymbolsV1Marker {
const COMPONENT = "datetime";
const SUBCOMPONENT = Some("symbols");
}
Notes from the scratch pad:
pub fn try_new_unstable<D>(
provider: &D,
locale: &DataLocale,
options: CollatorOptions
) -> Result<Self, DataError>
where
D: DataProvider<CollationSpecialPrimariesV1Marker> + DataProvider<CollationDataV1Marker> + DataProvider<CollationDiacriticsV1Marker> + DataProvider<CollationJamoV1Marker> + DataProvider<CollationMetadataV1Marker> + DataProvider<CollationReorderingV1Marker> + DataProvider<CanonicalDecompositionDataV1Marker> + DataProvider<CanonicalDecompositionTablesV1Marker> + ?Sized,
pub fn try_new_unstable<D>(
provider: &D,
locale: &DataLocale,
options: CollatorOptions
) -> Result<Self, DataError>
where
D: DataProvider<Collation_SpecialPrimaries_V1> + DataProvider<Collation_Data_V1> + DataProvider<Collation_Diacritics_V1> + DataProvider<Collation_Jamo_V1> + DataProvider<Collation_Metadata_V1> + DataProvider<Collation_Reordering_V1> + DataProvider<Collator_CanonicalDecomposition_Data_V1> + DataProvider<Collation_CanonicalDecomposition_Tables_V1> + ?Sized,
pub fn try_new_unstable<D>(
provider: &D,
locale: &DataLocale,
options: FixedDecimalFormatterOptions
) -> Result<Self, DataError>
where
D: DataProvider<DecimalSymbolsV1Marker> + ?Sized
pub fn try_new_unstable<D>(
provider: &D,
locale: &DataLocale,
options: FixedDecimalFormatterOptions
) -> Result<Self, DataError>
where
D: DataProvider<Decimal_Symbols_V1> + ?Sized
Collation_CanonicalDecomposition_Tables_V1
TimeZone_IanaMapper_ToBcp47_V2
or TimezoneIanamapperTobcp47V1Marker
TimeZone_IanaMapper_ToIana_V1
TimeZone_Metazone_Period_V1
TimezoneIanaToBcp47V1
, TimezoneBcpToIanaV1
, icu::timezone::provider::iana_mapper::ToIanaV1Marker
Current: DangiDatePatternV1Marker
+ “datetime/patterns/dangi/date@1”
Shane Proposes: DateTime_Patterns_Date_Dangi_V1
Rob Proposes: DatetimePatternsDateDangiV1Marker
, icu::datetime::provider::patterns::date::DangiV1Marker
where: P: DataProvider<patterns::date::DangiV1Marker> + DataProvider<patterns::date::GregorianV1Marker> ...
where: P: DataProvider<DateTime_Patterns_Date_Dangi_V1> + DataProvider<DateTime_Patterns_Date_Gregorian_V1> ...
icu::datetime::provider::patterns::date::DangiV1Marker
datetime/patterns/date/DangiV1Marker/en.json
datetime/patterns/date/dangi@1/en.json
Discussion:
ListV1Marker
but in datagen they need to select list/and@1
. I fixed this by accepting both formulations in datagen. But they still see the path3 in errors.Some proposals:
More discussion:
Conclusion:
LGTM: @sffc @robertbastian @Manishearth
I am assigned this issue for 2.0 to do an audit of datetime markers. [^1]
The audit, whether in icu_datetime or elsewhere, brings back the data marker naming discussion where we are at an impasse (the lack of an option that has no veto); see https://github.com/unicode-org/icu4x/issues/5276. I do not know how I am expected to perform the audit without having a consensus on how to name things.
I would like to therefore propose the following rules which largely encode the status quo but make a few changes that better enforce consistency and circumvents the casing controversy by avoiding types that would trigger it, like we did for datetime field sets.
Thing | Description | Naming Rule | Example: icu_decimal | Example: icu_datetime |
---|---|---|---|---|
Data struct | A data struct that is the Yokeable of a data marker |
Based on the data it holds and/or the context in which it is used, with version number, without crate namespace | SymbolsV1 | MonthNamesV1 |
Helper type | A struct or enum that is a field of a data struct but is not itself the Yokeable of a data marker |
Based on the data it holds and/or the context in which it is used, without version number, with optional crate namespace | DecimalSymbolStrs | PackedPluralElements |
Data marker | A non-exhaustive unit struct used in try_new_unstable function bounds |
Based on the associated data struct. No crate namespace. Suffixed with "Marker". May add an infix before the version number if there are multiple markers using the same data struct. Usually exported directly from the provider module in the crate. |
SymbolsV1Marker | MonthNamesGregorian V1Marker |
Data path | A string associated with a data marker which forms the basis for the hash key and filesystem paths | Directly derived from the data marker name by converting it to snake case, changing V1Marker to @1 , and adding the crate name prefix without icu_ . If there is an infix (which we can detect by comparing the data struct name to the data marker name), the infix is separated by a slash instead of an underscore in the path. |
decimal/symbols@1 | datetime/month_names/ gregorian@1 |
Bounds in ctors | The type that is written in the bound of try_new_unstable and shows up in docs |
If the marker is in the same crate as the ctor, import it. If the marker is in a different crate, use the fully qualified path. | icu_decimal::provider:: SymbolsV1Marker (fully qualified only if cross-crate) |
icu_datetime::provider:: MonthNamesGregorian V1Marker (fully qualified only if cross-crate) |
Datagen arguments | How data markers can be specified as arguments to icu4x-datagen | Path names only. No marker names. | decimal/symbols@1 | datetime/month_names/ gregorian@1 |
There's a lot to digest in that table, but I think we can only have a productive discussion holistically.
I'll highlight a few key changes in this proposal:
Obviously @robertbastian isn't here to give feedback on this, but I don't know any other solution that is actionable. The alternative is "drop the datetime audit and leave everything as-is to revisit in 3.0", acknowledging that the status quo is a bit of a mess but that the TC was unable to reach consensus.
[^1]: There aren't that many data markers. The vast majority are the ones in the datetime crate, which we already slated for the audit, and the properties markers, which are generated by macros. So, I'm volunteering to do the audit for all crates in 2.0.
I'm in favor of this. I don't necessarily think we need to require qualification for cross crate stuff. Currently because of provider
it will lead to long icu_decimal::provider::SymbolsV1
imports which isn't ... great. But it's ultimately fine.
I'm in favor of this. I don't necessarily think we need to require qualification for cross crate stuff. Currently because of
provider
it will lead to longicu_decimal::provider::SymbolsV1
imports which isn't ... great. But it's ultimately fine.
Part of my thinking was:
try_new_unstable<P>(...) where P: ...
, not in try_new() with crate::provider::Baked: ...
Thought: we could make the infixed marker names be MonthNamesV1GregorianMarker
. Then the path could always be derived directly from the marker name and vice-versa, without having to know the data struct name.
icu_foobar::provider::HelloWorldV1Marker
<=> foobar/hello_world@1
icu_foobar::provider::HelloWorldV1LoremIpsumMarker
<=> foobar/hello_world/lorem_ipsum@1
In other words, it would be possible to write a fn that maps from type_name::<M>()
to the path name and vice-versa. I think this is one of the things @robertbastian would have liked to have seen in a solution here.
@sffc "we don't have many cross crate bounds" sells me. And yeah I don't mind the explicitness too much.
Thought: we could make the infixed marker names be MonthNamesV1GregorianMarker. Then the path could always be derived directly from the marker name and vice-versa, without having to know the data struct name.
Yeah this makes a lot of sense to me! I think we prefixed it because we always did, but I actually don't see a strong reason to prefix.
Yeah this makes a lot of sense to me! I think we prefixed it because we always did, but I actually don't see a strong reason to prefix.
Just highlighting that my last reply above was to put the infix between the "V1" and the "Marker"?
The rule for naming a data marker would be: [Struct]V[Version][Infix]Marker
If we do [Struct][Infix]V[Version]Marker
, it is not as easy to figure out what part of the name is Struct
and which part is Infix
, which matters for the path conversion.
Just highlighting that my last reply above was to put the infix between the "V1" and the "Marker"?
Oh, I see. Hmmmmm. I'd rather have it always be V1Marker but I'm not that picky about it.
Just highlighting that my last reply above was to put the infix between the "V1" and the "Marker"?
Oh, I see. Hmmmmm. I'd rather have it always be V1Marker but I'm not that picky about it.
I had the thought, "isn't it nice to search for V1Marker and find everything", but then I realized that there are V2Marker, V3Marker, etc. With this new naming convention, you can still grep for V\d\w*Marker
.
Also, a possible advantage of this naming scheme is that the data struct name is fully a substring of the data marker name. If the data struct is SymbolsV1
, you either stuff Marker
or InfixMarker
at the end of it, but the prefix will always be the data struct name.
Current architecture:
DataProvider
paths, they are only used inDynamicDataProvider
BufferProvider
/AnyProvider
are functionallyDynamicDataProvider<BufferMarker/AnyMarker>
dyn KeyedDataMarker
KeyedDataMarkers
Proposal 1: Align data marker names with data provider names
Proposal 2: Get rid of "data key" terminology
DataMarker
(new terminology)DataMarkerInfo
s on the APIProposal 3:
DataMarkerInfo.path
becomesDataMarkerInfo.name
path
is not clear (it probably originated from FS provider)DataMarker
(new terminology) in the tagged string.AndListV1Marker
on the API, they can tell datagen to generate"AndListV1Marker"
(currently they have to figure out that it's"list/and@1"
)AndListV1Marker/en.json
instead oflist/and@1/en.json
.icu_list::provider::AndListV1Marker
), FS provider can then doicu_list/AndListV1Marker/en.json
Proposal 4: Split
DataMarkerInfo
's fields (maybe)IS_SINGLETON
closer to the type system could allow further optimisationspub trait SingletonDataMarker: DataMarker
insteadNAME
s can be optimised out if nothing uses them. The hash would live inside the name, so blob provider would still use the name, and static slicing would still work. However, in non-dynamic-data-provider binaries without debug logging, the names can be removedDataKeyInfo
parameter by three parameters, so that's a downside to proposal