unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 176 forks source link

Should we adopt a more consistent naming convention for traits and marker types? #5276

Closed sffc closed 1 month ago

sffc commented 3 months ago

A lot of my time in rewriting the API in icu_datetime has been spent in figuring out the best name for traits and marker types. This is especially challenging when the best name for a trait happens to also be the best name for the struct implementing the trait!

But, I was thinking, why do we let ourselves have this problem? It would be very nice if we could adopt a naming convention that simply avoids clashes.

Some languages with interfaces use the I prefix to indicate that the name is an interface. We're kind-of doing this currently with the Marker suffix on marker types.

Idea for traits: should be an adjective or adjective phrase:

For markers, I'm still very much in favor of introducing a convention of using title snake case. Why?

  1. Marker types are unlike any other concept: they are not structs as they have no data; they are not enums as they have no variants; they are not traits as they are concrete types; and there are often very many marker types representing variations of the same type of thing. Therefore, conventions that apply to other concepts do not necessarily apply to marker types de novo.
  2. Marker types benefit from a built-in taxonomy system.*
  3. The way that we are using marker types (a whole bunch of symbols implementing traits with associated types) is novel enough in Rust that we are in a position to introduce a convention for their names.
  4. icu::datetime::Formatter<Gregorian, Month_Day_Hour_DayPeriod> is clear. To me, it is also visually appealing.

* What I mean by "built-in taxonomy system" is that the identifiers benefit from being able to group together multiple tokens into a single identifier. This applies to both of the primary use cases we have for markers:

  1. In data markers, they represent a hierarchical grouping:
    • DateTime_Names_Gregory_Year
    • DateTime_Names_Gregory_Month
    • DateTime_Patterns_Gregory_Time
    • DateTime_Patterns_IslamicCivil_SemanticSkeleta
  2. In field set markers, they represent sets of things:
    • Era_Year_Month ("July 2024 AD")
    • Year_Month_Day ("July 22, 2024")
    • Month_Day_DayPeriod ("July 22 in the afternoon")
    • Weekday_DayPeriod_Hour_Minute ("Monday at 6:32 in the evening")

Title snake case has the distinction of being the only naming convention that allows for groups of multi-word tokens. It is camel case for the individual words and snake case for the grouping. This is an objective benefit of this casing convention over others (form follows function).

Putting various casing and naming conventions head to head:

  1. Formatter<Gregorian, Month_Day_Hour_DayPeriod> (clear, no ambiguity)
  2. Formatter<Gregorian, MonthDayHourDayPeriod> (contains ambiguity)
  3. Formatter<Gregorian, MonthDayHourDayperiod> (but "day period" is two words everywhere else)
  4. Formatter<Gregorian, MonthDayHourPeriod> (rename "day period" to "period" because of this limitation, but could conflict with other potential concepts such as a proposed "year period" that includes quarters and trimesters)
  5. Formatter<Gregorian, MDHP> (if you can't win, just abbreviate everything)
  6. Formatter<Gregorian, Mdhp> (I really don't like this one but I'll list it anyway)

I honestly see only advantages of option 1.

I know @robertbastian and @Manishearth have expressed skepticism of this idea, but I do not have a clear record of their arguments. The only position I can recall, and I apologize in advance for probably misrepresenting it, is that it is novel and raises questions when people see it in the docs. (My responses: novel yes, but I argue that novel is okay as stated above; raises questions is also okay because this is a novel concept where prior assumptions about its behavior should be thrown out.) I hope we can at least document those arguments in this thread.

Also seeking input from @echeran, @hsivonen, @zbraniecki, @markusicu, @eggrobin, and anyone else.

Manishearth commented 3 months ago

Rust already has trait naming conventions: https://rust-lang.github.io/rfcs/0344-conventions-galore.html#trait-naming, which suggests transitive verbs, if not verbs then nouns, if not nouns, only then adjectives.

Put differently:

-able trait names are actually not that good in this model: and Queryable sounds extremely Java-y and not very Rusty to me. Writeable sounds fine mostly because we're avoiding clash with Write, but otherwise I do not consider that a shining example of trait naming.

DataProvider seems fine, it's a noun. A better name might be Load named as a verb after its one method but that sounds too generic. There may be better verbs to use here, like Provide or DataProvide or DataLoad. (I think mentioning "Data" is good).

TrieValue is also great in my book. It's a noun, and is great at describing what it does.

I don't see any reason for us to diverge from this by default.

This is especially challenging when the best name for a trait happens to also be the best name for the struct implementing the trait!

But, I was thinking, why do we let ourselves have this problem? It would be very nice if we could adopt a naming convention that simply avoids clashes.

I think we're ending up in this situation when we have a trait and a single struct that is the primary implementor. That is an unusual situation to be in! We should avoid that, and when we can't I think it's fine to be weird about naming on a case by case basis. I do not think changing the naming conventions for the rest of the project to be counter to Rust naming conventions is an appropriately sized reaction to having this problem for a couple types.

Can you provide specific examples of where this has happened? I recall it happening but I can't recall examples and without examples we might be talking in the air.

I think the rule of "use an adjective phrase when this happens" or "use Is and Has when this happens" is a good way to have a default convention for these cases, and I think we've used that trick in the past.

A lot of my time in rewriting the API in icu_datetime has been spent in figuring out the best name for traits and marker types. This is especially challenging when the best name for a trait happens to also be the best name for the struct implementing the trait!

N.b it appears we are talking about multiple kinds of markers here, which I think is worth highlighting specifically.

In general I would like us to not overload "marker" when naming such types and traits (I want "marker" in ICU4X naming to only mean data markers), though I do recognize that it is a standard way to reference this Rust patterns.

The only position I can recall, and I apologize in advance for probably misrepresenting it, is that it is novel and raises questions when people see it in the docs. (My responses: novel yes, but I argue that novel is okay as stated above; raises questions is also okay because this is a novel concept where prior assumptions about its behavior should be thrown out.) I hope we can at least document those arguments in this thread.

My position is that it is sufficiently novel as to be surprising and distracting, in the sense that it draws a lot of attention in the code. Experienced programmers in any language get used to mentally sight-reading code, and syntax highlighting/casing conventions form a part of this process. Every time I have seen code that introduces brand new casing conventions my mental parser short circuits and tunnels in on the identifier. This still happens even when I am fully familiar with the novel casing convention and what it means. I've heard similar things from other people. (Particularly, I'm prone to defaulting to snake case identifiers if I'm not paying attention to the language I'm writing, and people have often told me how jarring it is to see that in languages that are strictly camelcase)

I'd want to see a really good reason to break this expectation and surprise our users. I don't really think "we have many options but none of them is great" is sufficient, this is just not-great in a different way, and a far more novel and surprising one.

I think Option 3 MonthDayHourDayperiod is pretty fine. These are a mess of types that I don't actually expect people to be typing out themselves: "whether P in Period is capital" is not the only question one must answer to be able to type these from memory, you also have to remember the order of the components, which has logic to it but is not necessarily straightforward either.

Put differently: I don't think the predictability of these identifiers is super important. I think their readability is, and as such the lack of consistency with DayPeriod isn't that important there.

form follows function

(I don't really think "form follows function" is particularly useful in a naming context)

robertbastian commented 3 months ago

The only position I can recall, and I apologize in advance for probably misrepresenting it, is that it is novel and raises questions when people see it in the docs. (My responses: novel yes, but I argue that novel is okay as stated above; raises questions is also okay because this is a novel concept where prior assumptions about its behavior should be thrown out.) I hope we can at least document those arguments in this thread.

I think it also raises the question of whether we're doing Rust correctly. The only crates I've seen that don't follow naming conventions are C wrappers, and those often have horrible Rust APIs, are unsafe, etc.

Manishearth commented 3 months ago

Yeah. ICU4X does not do that much weird stuff that we should need to be a special snowflake in these ways. I'm always a bit wary when I find crates that seem to be treating themselves as a special snowflake when it comes to Rust norms: typically their reasons for doing so seem to entirely be inside baseball and irrelevant to me as a user, and it makes me worried as to what other less-obvious norms they are deviating from.

The "surprise" aspect I mentioned is a regular experience of mine reading code interacting with C wrappers, and that's not even a case where there are new naming styles introduced: they just use extant Rust naming styles in the "wrong place" (typically: snake case names for types).

sffc commented 3 months ago

I don't see any reason for us to diverge from this by default.

Ok

We could explore DataLoad or DataProvide, but if DataProvider is conformant with the Rust RFC, then I think let's leave it alone for 2.0 purposes.

Can you provide specific examples of where this has happened? I recall it happening but I can't recall examples and without examples we might be talking in the air.

There are cases where we are getting close in icu_datetime. For example, FieldSet is a struct, but ConstFieldSet is a trait; I was slightly more comfortable with the HasConstFieldSet and IntoRuntimeFieldSet naming, but those were not popular with the rest of the group. DateTimeInput would be a struct, but DateTimeInput is currently the name of the trait we have in 1.0 (the trait is changing to GetField or similar in 2.0).

My position is that it is sufficiently novel as to be surprising and distracting, in the sense that it draws a lot of attention in the code. Experienced programmers in any language get used to mentally sight-reading code, and syntax highlighting/casing conventions form a part of this process. ...

Thanks for verbalizing this.

I see this argument as an "in my experience" type argument. I can't say that I share this experience in mentally parsing code, which could amount to differences in background. In my experience, the most important naming convention for visually parsing code is that types start with capital letters and variables/functions start with lowercase letters, with various exceptions like scream-case constants (and the C# programming language).

I was first introduced to title snake case in my work with Unicode properties, and while it was a bit strange at first, I have come to appreciate it.

I'd want to see a really good reason to break this expectation and surprise our users. I don't really think "we have many options but none of them is great" is sufficient, this is just not-great in a different way, and a far more novel and surprising one.

Setting aside "in my experience" arguments, I do still hold the position that the reasons I listed in the OP amount to a "good" reason to adopt the proposed naming convention (I would also subjectively say "really good"). I never stated "we have many options but none of them is great" as justification. My reasons are the 1/2/3/4 list in the OP, reproduced here:

  1. Marker types are unlike any other concept, to the point that they deserve their own casing convention
  2. Marker types benefit from a built-in taxonomy system
  3. Given how widely we use them, ICU4X in particular is in a position to introduce the convention
  4. icu::datetime::Formatter<Gregorian, Month_Day_Hour_DayPeriod> is crystal clear (and, subjectively to me, visually appealing)
sffc commented 3 months ago

This is a sub-point on Reason 2, but it might deserve being its own reason: I honestly find camel case hard to read. I squint at my screen to figure out where the capital letters are that separate the token into words. For single nouns and noun phrases, this is not super important, because the whole token expresses a single concept. However, for marker types, they do not represent a single concept! Field set marker types are a bag of things. The collection is crucial to understanding the type.

As a little exercise: how quickly can you can down this list and tell me which ones include the hour?

In my experience, when I read that list, I can clearly see that these are symbols that are composed of individual tokens. I can quickly see how many tokens are in the list, and I can quickly identify what those tokens are.

Now try the same exercise if we use a camel case convention:

At least for me, in my experience, I see the second list as 6 symbols. Secondarily, each symbol appears to be in camel case, so it probably has multiple words, but, based on my experience, I can assume that the individual words are unimportant for understanding the purpose of the type.

Manishearth commented 3 months ago

There are cases where we are getting close in icu_datetime. For example, FieldSet is a struct, but ConstFieldSet is a trait; I was slightly more comfortable with the HasConstFieldSet and IntoRuntimeFieldSet naming, but those were not popular with the rest of the group. DateTimeInput would be a struct, but DateTimeInput is currently the name of the trait we have in 1.0 (the trait is changing to GetField or similar in 2.0).

Yeah I think DateTimeInput is the best example of this and that's a very niche situation to be in IMO, we shouldn't decide our overall naming practice by that. I think it's good to come up with a set of rules for when that happens next, and that would be something like:

I see this argument as an "in my experience" type argument

I think it's worth highlighting that it is not just my experience but others' experience too, there.

I was first introduced to title snake case in my work with Unicode properties, and while it was a bit strange at first, I have come to appreciate it.

To be clear, I'm quite used to this convention at this point because of working with Unicode properties, I'm just not used to it in Rust, which is what I'm worried about.

I never stated "we have many options but none of them is great" as justification.

Yes, that is me paraphrasing what I consider the strongest justification here. I find the other points to be weaker. I'll address them here:

Marker types are unlike any other concept, to the point that they deserve their own casing convention

I really don't see this at all. At least in my experience in Rust marker types are not that different, and I don't see why they should be visually distinct. From a user perspective I rarely care that much whether something is a marker type: if it appears in param/return lists I need to worry about it, if it doesn't, then I don't. During the course of the datetime design we have discussed turning various marker types into structs and maybe vice versa. They're not that different.

Furthermore, the strong benefits of this design are only really for the subset of marker types that mark a "bag of stuff", which is in our case largely just the fieldset types. I find the justification here even less applicable for data markers: it's nice for dealing with them from codegen, but those aren't even ones people need to bother parsing too much!

Marker types benefit from a built-in taxonomy system

I am entirely unconvinced of this.

Given how widely we use them, ICU4X in particular is in a position to introduce the convention

I do not think ICU4X is anywhere near as influential as being able to do this successfully. This is not a statement on the current popularity of ICU4X either, rather that as an internationalization library there is a limit to how influential we can be here. Libraries that have succeeded in introducing new convention this way are extremely few.

icu::datetime::Formatter<Gregorian, Month_Day_Hour_DayPeriod> is crystal clear (and, subjectively to me, visually appealing)

This I agree with, but I'm not yet convinced that this being crystal clear is that important to optimize for.

However, for marker types, they do not represent a single concept! Field set marker types are a bag of things

This ties in to my earlier point about this being just about "bag of stuff" marker types, not marker types in general. If your primary argument is that marker types are a different type of thing that we should introduce a new convention for, the benefits of that convention should work for all markers, not just ones like FieldSet.

As a little exercise: how quickly can you can down this list and tell me which ones include the hour?

To be clear: I totally agree that Foo_Bar scans easier when it comes to "find out what fields are there". I'm just not convinced that that is super important. My expectation of their use will be people copy pasting them from docs and then not looking too closely at them. I don't want people looking too closely at them, they really should blend into the background.


I suspect besides the difference in experience, the two core value differences we have here are:

sffc commented 3 months ago

On trait names: I generally strongly prefer general solutions that avoid clashes ("just work"), but we can fall back to a case-by-case bikeshed given that the referenced Rust RFC is not clash-free.


On marker types:

I suspect besides the difference in experience, the two core value differences we have here are:

I'm not convinced that these being scannable is that important I'm not convinced that marker types are so special as to need a separate taxonomy, and I think that justification clashes a bit with the justification about readability benefits for "bag of stuff" markers when most Rust markers aren't "bag of stuff".

I'll address the first point below.

On the second point: the two biggest categories of marker types are field set markers and data markers. The elements in the field set markers are the fields being displayed, and in data markers they are the fully qualified path to the data marker, which shows up in directory paths. I acknowledge that the justification of "bag of things" as the mental model for field set markers is the stronger of the two, so maybe let's take this one step at a time: first for the field marker types in isolation, and then, only after we can agree on what to do there, revisit the more general problem.


For field set markers: I'll start with contrasting pros and cons of title snake case and "smushed camel case" (which I currently find objectionable, but the least so of the alternatives I listed earlier):

  1. Title snake case: Month_Day_Hour_DayPeriod
    • Pro: Signifies that the type is something special (not constructible: usable for type system hints only). (Point 1)
    • Pro: Indicates that the type signifies a bag of stuff. (Point 2)
    • Pro: Emphasizes that the type is an important and unique piece of the ICU4X API that should be learnt. (Point 3)
    • Pro: Crystal clear. (Point 4)
    • Con: Novel casing convention that is "surprising and distracting" to some users.
  2. Smushed camel case: MonthDayHourDayperiod
    • Pro: Less "risky".
    • Pro: Doesn't require type name lint suppressions.
    • Con: Difficult to read / scan in code files and docs.
    • Con: Novel casing system in the sense that two-word tokens are titlecased as a single segment.

I personally find this to be a slam dunk: I highly value the clarity and readability, and I believe it is worth taking a risk to advance that objective. We can break down why:

  1. Everything we can do to make neo datetime formatter more readable and clear is critical, given that neo datetime formatter is not very easy to learn, even for ICU4X-TC contributors.
  2. The best mental model for field sets is "a bag of fields that are turned on for this formatter". Title snake case indicates that it is a bag of fields, assisting with the mental model.
  3. The users of ICU4X DateTime formatting are likely to be self-selected and have fewer alternatives to achieve their localization needs, so I feel comfortable taking a slightly higher level of risk with our audience. (This applies to many of the design decisions we've made that have gone slightly against the grain of what's most common in Rust. It doesn't apply to our utility crates like ZeroVec and DataBake.)
Manishearth commented 3 months ago

On trait names: I generally strongly prefer general solutions that avoid clashes ("just work"), but we can fall back to a case-by-case bikeshed given that the referenced Rust RFC is not clash-free.

I'm not convinced "always use adjectives" will "just work" in edge cases, much like the Rust RFC rules do mostly "just work" except when we really need a struct and trait to have similar naming, which to me is very much an edge case in Rust. So I don't think that's a general solution either.

I acknowledge that the justification of "bag of things" as the mental model for field set markers is the stronger of the two, so maybe let's take this one step at a time: first for the field marker types in isolation, and then, only after we can agree on what to do there, revisit the more general problem.

(I think this is where I derive the "we have many options but none of them is great" justification, fwiw. I don't see our justifications expanding beyond the fieldset markers. I can see a different set of justifications for data markers, but again, they don't generalize)

Con: Novel casing system in the sense that two-word tokens are titlecased as a single segment.

I don't find that super novel, we've already had Metazone vs MetaZone, this is a common enough thing I see in the Rust community where people decide in different ways. It's not one where there's a single consensus answer, but that's good for us, it means that either casing choice is more or less acceptable.

Pro: Emphasizes that the type is an important and unique piece of the ICU4X API that should be learnt. (Point 3)

Pro: Signifies that the type is something special (not constructible: usable for type system hints only). (Point 1)

I actually don't want us to call attention to this. This isn't that worth emphasis. It's worth learning, but there's a lot of things about this API that are equally important.

The fact that it isn't constructible isn't that important to me: whether or not it's constructible is only important when you get to the point where an API seems to want to ask for an instance.

There are times where we have "it's not important how to construct this, it's just a type system hint" types that aren't markers as well! TypedDateFormatter<Cal> doesn't need Cal during construction, from its point of view it may as well be a marker type, all that it does is enforce type safety and drive appropriate data loading. Which is what the Fieldset markers do here as well! Our calendar types end up being non-markery due to a different set of APIs, ones which may get used here but do not mandatorily get used here, and the fact that they are or aren't markers is completely irrelevant for tis API.

If anything, this makes me more convinced that treating these as a special snowflake is a bad idea: I think it is worth having this type of fluidity between types that are "important for type system hints" between them being dataful or just zero sized markers. I think we already had some moments during the design of these APIs where what initially was a marker+struct pair got collapsed into a single struct, and I think we've gone in the other direction too. This reveals to me that conceptually whether or not it is a marker isn't really that important! It's only contextually important whether or not something needs to be constructed, and that is clear enough from the APIs you use to get the parametrized type.

Con: Novel casing convention that is "surprising and distracting" to some users.

I'll go a step further and say that it can be bad for our reputation. As you mention we do go against the Rust grain a couple times and that's often in well-justified scenarios, definitely not ones where I see people getting annoyed as us for it. I still do not see the justification here as particularly strong, and I think us choosing to do this will annoy users and come off as us being a bit arrogant with considering ourselves to be super special (I've seen other crates get talked about negatively for treating themselves as special in weird ways). I really don't think we're super special here, marker types are a well-established pattern across Rust.

Because of this I find the bar for doing this to be quite high, I do think that ultimately, Month_Day_Hour_DayPeriod is clearer than MonthDayHourDayperiod, I just don't think that that delta in clarity is anywhere close to sufficient to justify doing this. I basically find MonthDayHourDayperiod to be clear enough.

sffc commented 3 months ago

we've already had Metazone vs MetaZone

Metazone wasn't a casing decision; we actually decided that "metazone" was one word. You might be thinking of "time zone", where we decided to keep the module name icu::timezone with the justification that module names should use concatenated lowercase.

I actually don't want us to call attention to this. This isn't that worth emphasis. It's worth learning, but there's a lot of things about this API that are equally important.

For clients, the field set markers are the most important and unique part of the new API. The rest is standard ICU4X: constructors, format functions, compiled data, data providers, etc.

There are times where we have "it's not important how to construct this, it's just a type system hint" types that aren't markers as well! TypedDateFormatter doesn't need Cal during construction, from its point of view it may as well be a marker type, all that it does is enforce type safety and drive appropriate data loading. Which is what the Fieldset markers do here as well! Our calendar types end up being non-markery due to a different set of APIs, ones which may get used here but do not mandatorily get used here, and the fact that they are or aren't markers is completely irrelevant for tis API. ... I think we already had some moments during the design of these APIs where what initially was a marker+struct pair got collapsed into a single struct, and I think we've gone in the other direction too. This reveals to me that conceptually whether or not it is a marker isn't really that important!

I've done my homework in trying to make these marker types also be concrete types. We did that for calendar systems. But I'm convinced in my multiple failed attempts that these field set marker types really are just marker types. They are in fact special marker types in the sense that they are compile-time collections of things.

Note that I'm not proposing changing this convention for the IsRuntimeComponents type parameters, only for the marker types that exist for the purpose of implementing HasConstComponents.

I'll go a step further and say that it can be bad for our reputation. ...

This type of argument seems to be based on personal experience. I just find it very hard to believe that using title snake case for date field marker types would rise to the level of reputational loss.


Would it change your opinion if we shipped type aliases for the common cases, which is something I kind-of wanted to do anyway? For example,

pub type DateFormatter = Formatter<Year_Month_Day>;
pub type DateTimeFormatter = Formatter<Year_Month_Day_Hour_Minute>;
// ...

Then the field set marker types wouldn't be the first thing people see. They would be something they see when they want to do something that the default type parameters don't do.

sffc commented 2 months ago

I talked this over with @echeran and @kartva today.

I presented the following question: We have a type icu::datetime::Formatter<Foo>. When Foo is a field set, what should it be in code? Assume all symbols are in a module and could be referenced via their module (such as icu::datetime::fieldsets::Weekday).

Option Weekday Full Year + Month
1 Weekday FullYear_Month
2 Weekday[Marker] FullYearMonth[Marker]
3 Weekday[Marker] FullyearMonth[Marker]
4 E YM

Doing my best to summarize the initial reactions:

Then the conversation continued:

(this summary was approved by @echeran and @kartva)

sffc commented 2 months ago

On the trait naming: the other example I almost forgot to mention but which we've definitely hit problems is in the data provider constellation. BufferProvider is a trait but BlobDataProvider is a struct. We could have just as reasonably done the opposite assignment of names.

Manishearth commented 2 months ago

I think the Rust trait scheme would have unambiguously given BufferProvide and BlobDataProvide for traits. We didn't follow it, and what we chose instead is still acceptable, but "verbs first, then nouns, then adjectives" would still have been unambiguous here: I don't think that's evidence against it. It is seldom the case that a struct can be named like a verb.

Manishearth commented 2 months ago

Would it change your opinion if we shipped type aliases for the common cases, which is something I kind-of wanted to do anyway? For example,

Not really, no, since the long names would still show up relatively commonly. It's slightly better, though.

Metazone wasn't a casing decision; we actually decided that "metazone" was one word. You might be thinking of "time zone", where we decided to keep the module name icu::timezone with the justification that module names should use concatenated lowercase.

I was thinking of both, but yes.

For clients, the field set markers are the most important and unique part of the new API. The rest is standard ICU4X: constructors, format functions, compiled data, data providers, etc.

The traits that work with these are equally important. The GetField stuff as well. There's a lot of weird stuff about this API.

But regardless of that, I don't think picking an entirely novel casing convention is an appropriate way to call out attention in the first place.

I've done my homework in trying to make these marker types also be concrete types. We did that for calendar systems. But I'm convinced in my multiple failed attempts that these field set marker types really are just marker types. They are in fact special marker types in the sense that they are compile-time collections of things.

Oh, I have no doubt that these are and should remain markers. My point is that the fact that we have in the past had reasonable arguments for having them not be markers, and other markers got un-markered similarly, tells me that markers are really not that special. I keep hearing the argument that they're special here, and I simply don't see any evidence for it.

This type of argument seems to be based on personal experience. I just find it very hard to believe that using title snake case for date field marker types would rise to the level of reputational loss.

Yes, it's personal experience, but I think it's important! I'm happy to ask other community members who have been around a while to weigh in and say what they would feel about a library that made a choice like this.


For what it's worth, I'm not against YMD etc being chosen instead, if conciseness is valued sufficiently I'm happy to go along with that. I'm mostly against the snake case one, and amongst the rest I find YearMonthDayperiod > YMDp > YearMonthDayPeriod but could be okay with any of them.

Manishearth commented 2 months ago

Oh, if we did end up going the YMD route I think we very very much should do the aliases you mentioned, though.

zbraniecki commented 2 months ago

My soft (0.5 in Apache scale) preference is: YMD > YearMonthDayPeriod, Year_Month_DayPeriod.

Here is my mental model: I consider those markers as an outlier not only in Rust ecosystem but also in ICU4X. I hope we will not need them often, and only the most advanced and complex API will need them - DateTimeFormat. Maybe one or two more, but I really hope not. DTF happens to be the really sensitive one, with tons of people having strong preferences, both users and designers. It's tricky and that's why Shane's work is monumental. He doesn't just rewrite a component, or a component that is hot. He is rewriting the touchpoint of Software and Internationalization. NumberFormat may be used more often, but interacted with much less. Same for Locale.

This means that the community will judge us by how ICU4X DateTimeFormat "feels". And we should aim to make the ergonomic API... well, ergonomic.

Ok, so let's look at the example of the API call:

let length = SkeletonLength::Medium;
let locale = locale!("en-u-ca-hebrew");

let formatter = Formatter::<YearMonthDayMarker>::try_new(&locale.into(), length).unwrap();

I understand we can provide convenience types, but the trick with them is that the pit of success with them is narrow. One step away and you're in a new world with marker types and weirdness - weirdness engineers didn't sign up for. It's a hermetic knowledge most devs don't want to have. They came here to "format a date in an i18n way" because in this week's sprint they have 10 tasks and this is one of them.

They are displeased to realize they have choices which ask them to make decisions. They want "just the simple thing" more often than not. Very few come excited to see the breadth of options they have in front of them. And those that do are already aware how challenging the problem is, they have battle scars to show, they came ready for a week of exploration, not 30 minutes. We don't have to help them as much.

Getting back to the Joe Average Developer. We may provide them the alias, and let's say (judgement call) that 50% of it'll work. And 50% of time their UX comes back asking for to do 1-1 some Figma design that shows different format in en-US. They realize that instead of an alias, they now have to do the whole hermetic mumbo jumbo and they feel frustrated - they're punished for trying to do the right thing.

I actually don't think they'll care if Option 1 or Option 2. They won't spend time parsing DayPeriod vs Dayperiod or analyzing if DayDayPeriod is weirder than DayDayperiod or not. They'll just feel it's challenging. And in Option1, it's also weird.

Shane points out that it's a good thing - It's a new concept, new syntax communicates it. I disagree. For me it's as if I wanted to drive a car and the manufacturer (ekhm, Tesla) moved the gear shifter to a different location because "in Tesla it serves a different role, it's worth pointing it out". Cool, but I'm not trying to learn the Advanced Tesla, I'm trying to drive my rental to my destination. Enough of the analogy. Thanks for still reading :)

There are two solutions I can see:

  1. We can trick the generics
  2. We can write a macro for them

1. Trick generics.

let formatter = Formatter::try_new("en-US", Medium, YMD).unwrap();

Now we're talking! But the reality is that I'm hopefully not going to be hardcoding the locale, so no value in passing it manually. It comes from somewhere:

let formatter = Formatter::try_new(loc, Medium, YMD).unwrap();

I understand the need for locale, I get that I set the length to Medium, and the third one is weird, right? Yes, it's weird because I had to look up the list of available ones, and pick one based on some helper table that matches this keyword to some example that closely resembles my Figma.

I argue, and here's probably the crux of my argument hidden in a long mental model description, that the reader other than ICU4X dev won't usually care what components are in the marker. They won't read it much. They will just look for XYZ and match it visually to "January 5th 2024" being more what they mean than EWP being "Jan 5th 24 12:30pm".

And we can give it to them with something like `fn try_new(loc: &Locale, length: Length, skeleta: M)

This may be tricky, but it's doable.

2. Macro

If they care or value having it visually represented, they'd like something like that:

let formatter = Formatter::try_new(loc, Medium, { Year, Month, Day }).unwrap();

which is close to JS elaborate form:

let formatter = new Formatter(loc, {
  year: "long",
  month: "2-digit",
  day: "2-digit"
});

so... why not a macro? And if we do a macro, then we are not constrained to anything but dev ergonomics. We can encode it internally according to Rust rules, and noone will see it except in debugging, and in debugging, good luck if you're not familiar with ICU4X internals.

Manishearth commented 2 months ago

Worth noting, macro-ing the constructor doesn't necessarily make the type easier to read.

Hmm, if we're discussing things like macro-ified {Year, Month, Day}, then it's worth mentioning that without any macro we can totally do (Year, Month, Day) tuples. These aren't as concise but are quite readable, and scan nicely.

They do have some downsides:

I'm not sure I really want to advocate this, but I'm not super against exploring such routes.

I argue, and here's probably the crux of my argument hidden in a long mental model description, that the reader other than ICU4X dev won't usually care what components are in the marker

This I agree with; overall I think from the perspective of a user building a UI understanding exactly what is in these types is not that important. Important enough when they're making the choice (and looking at docs), but not important enough that it should stand out or really even be that easy to mentally parse when later reading the code. I really think that once you've picked which skeleton you want that choice should melt in to the background, it is not worth that choice being super loudly visible in the code once written.

Manishearth commented 2 months ago

I think Zibi's arguments make me more amenable to YMD as well, especially with his comments about what the reader of the code will care about.

@zbraniecki do you have ideas as to how the YMD scheme will deal with things like FullYear and DayPeriod?

Manishearth commented 2 months ago

Here's the full list of components I can find:

We don't necessarily need to have full type alias coverage for the zone types, but it seems like the main issue is going to be Day vs DayPeriod, we could just use Pd/Periodday. That scans better: the more important thing is up front, but allows us to retain Periodyear for later.

@sffc is MonthDayPeriodday unscannable for you? I think putting Period first makes it way clearer as to what's going on, and it will abbreviate much nicer if we decided to go with MDPd

sffc commented 2 months ago

So much activity! Thank you! Responding in chronological order

For clients, the field set markers are the most important and unique part of the new API. The rest is standard ICU4X: constructors, format functions, compiled data, data providers, etc.

The traits that work with these are equally important. The GetField stuff as well. There's a lot of weird stuff about this API.

But regardless of that, I don't think picking an entirely novel casing convention is an appropriate way to call out attention in the first place.

I don't agree that the other traits are equally important. The field set marker types implement about 7 traits each. GetField is not really that complicated of a trait, especially if we consolidate it into the marker type via the parameterized Input struct (#5269). I'm trying to keep most of the complexity in one place, which is the marker types and the traits they implement.

I don't think I've used "call out attention" as an argument one way or another in this thread. I do hold the position that the casing convention emphasizes that these are sets of things, which is different.

Oh, I have no doubt that these are and should remain markers. My point is that the fact that we have in the past had reasonable arguments for having them not be markers, and other markers got un-markered similarly, tells me that markers are really not that special. I keep hearing the argument that they're special here, and I simply don't see any evidence for it.

OK, so you're saying markers as a concept are not really special.

One could say that trait implementations are not really special. When we changed markers to non-markers, it was moving the trait impl from a "fake" struct/enum to a "real" struct/enum. The fact that you can freely move trait impls around between things is a really cool and powerful feature of the Rust programming language. And, the fact that Rust lets you make arbitrary types that don't actually do anything other than provide hints to the compiler is quite special.

But, apart from whether or not marker types as a general concept are special or not, the field set marker types are special in their own right:

  1. They are compile-time sets of things. This is a good mental model that in my limited experience really makes people cling and grasp the concept of semantic skeleta.
  2. They implement 7+ traits and have 50+ associated types, including nested associated types. If that's not "special", I don't know what is.

Yes, it's personal experience, but I think it's important! I'm happy to ask other community members who have been around a while to weigh in and say what they would feel about a library that made a choice like this.

I value and trust your judgement/experience in these matters, and I'm usually inclined to take it at face value, but I feel this situation is sufficiently visible and impactful to the usability of the API that I would rather like to see some data before being forced to adopt what I consider a painfully inadequate approach such as MonthDayDayperiod.

My anecdotal interview with @echeran and @kartva did not raise any concerns about the unconventional case convention.

In TC39 we have a subcommittee that specializes in researching and gathering data on exactly these types of questions.

Getting back to the Joe Average Developer. We may provide them the alias, and let's say (judgement call) that 50% of it'll work. And 50% of time their UX comes back asking for to do 1-1 some Figma design that shows different format in en-US. They realize that instead of an alias, they now have to do the whole hermetic mumbo jumbo and they feel frustrated - they're punished for trying to do the right thing.

I actually don't think they'll care if Option 1 or Option 2. They won't spend time parsing DayPeriod vs Dayperiod or analyzing if DayDayPeriod is weirder than DayDayperiod or not. They'll just feel it's challenging. And in Option1, it's also weird.

Shane points out that it's a good thing - It's a new concept, new syntax communicates it. I disagree. For me it's as if I wanted to drive a car and the manufacturer (ekhm, Tesla) moved the gear shifter to a different location because "in Tesla it serves a different role, it's worth pointing it out". Cool, but I'm not trying to learn the Advanced Tesla, I'm trying to drive my rental to my destination. Enough of the analogy. Thanks for still reading :)

...

I argue, and here's probably the crux of my argument hidden in a long mental model description, that the reader other than ICU4X dev won't usually care what components are in the marker. They won't read it much. They will just look for XYZ and match it visually to "January 5th 2024" being more what they mean than EWP being "Jan 5th 24 12:30pm".

...

overall I think from the perspective of a user building a UI understanding exactly what is in these types is not that important. Important enough when they're making the choice (and looking at docs), but not important enough that it should stand out or really even be that easy to mentally parse when later reading the code. I really think that once you've picked which skeleton you want that choice should melt in to the background, it is not worth that choice being super loudly visible in the code once written.

Of all the things in this thread, this is exactly the type of situation I totally agree with, but I reach the exact opposite conclusion! My position is that "compile-time sets of things" is a useful, intuitive, and accurate mental model. I claim devs should be able to understand from reading and writing code exactly what the formatter they found is supposed to do. FullYear_Month_Day_DayPeriod is clear, understandable, and malleable. You can see that it specifies 4 types of fields. You can change it to Month_Day_DayPeriod and it just works. You can change it to FullYear_Month_Day_DayPeriod_Hour and you might get a compile error that your input type doesn't support GetField<Hour>: neat, and helpful! You can try changing it to FullYear_Day and realize that symbol doesn't exist so it must not be a valid set. Both FullYearMonthDayDayPeriod and FullyearMonthDayDayperiod fail miserably at these tasks.

(Year, Month, Day) tuples

I'm open to exploring the tuple path. I think it's even more trait madness to make it work, but it has a nice outcome.

I think Zibi's arguments make me more amenable to YMD as well, especially with his comments about what the reader of the code will care about.

I'm open to exploring the ultra-abbreviated YMD path. It doesn't have all the same benefits as the approach I've been advocating for, but I'm drawn to its conciseness and readability for the common cases.

Here's the full list of components I can find:

The fields are changing. I have a proposal. The currently proposed list is:

And I intend this to be able to be extended to include:

@sffc is MonthDayPeriodday unscannable for you? I think putting Period first makes it way clearer as to what's going on, and it will abbreviate much nicer if we decided to go with MDPd

I don't find there to be a substantial difference between MonthDayDayperiod and MonthDayPeriodday.

sffc commented 2 months ago

Exploration on ultra-abbreviated, based on the set of fields in the current spec draft:

We are limited to the 26 letters. We could potentially use lowercase letters as modifiers on the letter that precedes them. I want to avoid the same letter repeated multiple times in a row because that has implications about field lengths that I want to avoid. Unlike classical skeleta and patterns, we don't have to be unique in each symbol, so long as the full identifier works out.

With that in mind, a potential mapping could be:

Field Letter Comments
FullYear Yx Year with something special about it. "x" seems like a good letter.
Year Y
Month M
Day D
Weekday W Or maybe the LDML letter "E"? "W" means Week but Weekday is more common; Week could be "K"
Hour H
Minute M Always follows "H" so it won't be ambiguous with the Month "M"
Second S
ZoneGeneric Zg Or use "v" which is the CLDR letter?
ZoneGenericShort Zgs Or introduce a different lowercase letter?
ZoneSpecific Zs
Zone... Z... Follow a similar pattern as above

These result in the following.

Day Field Sets:

Calendar Period Field Sets:

Time Field Sets:

Time Zone Field Sets:

General Composite Field Sets:

We could also add some pre-computed composite field sets, but maybe we don't need them with this model:

Manishearth commented 2 months ago

I'm open to exploring the tuple path. I think it's even more trait madness to make it work, but it has a nice outcome.

I'll respond to the rest later, but, to be clear, I think it has no additional trait madness. Specifically, we don't ever write any code generic over (T, U) where T: DayLike, U: Whatever, just that where we have YearMonthDay right now we instead have (Year, Month, Day).

It does bring up the qeustion of what we do with composites, though, because you can't have a tuple on the LHS of a type alias.

Manishearth commented 2 months ago

@sffc your potential mapping should probably include the aspirational field types to be complete, yes?

sffc commented 2 months ago

One potential concern with YMD is that it might more strongly signify that there is an order involved, since I've seen UIs that have enumerations allowing you to select "YMD, MDY, or DMY".

Maybe not a huge deal since "MDY" and "DMY" wouldn't be valid identifiers.

robertbastian commented 2 months ago

I like the macro idea. Something like field_set!{Year, Month, Day} that evaluates to icu_datetime::...::YMD but is actually readable.

Manishearth commented 2 months ago

Maybe not a huge deal since "MDY" and "DMY" wouldn't be valid identifiers.

Yeah, in general I think the needs of this API is that it should be readable, not necessarily writeable.

I like the macro idea. Something like field_set!{Year, Month, Day} that evaluates to icu_datetime::...::YMD but is actually readable.

Also in favor if we can make it work in all positions, macros are finicky.

zbraniecki commented 2 months ago

I'm also in favor of a macro.

sffc commented 2 months ago

Proposal:

LGTM: @Manishearth @sffc @robertbastian @zbraniecki

sffc commented 2 months ago

We have alignment on the datetime field set type names. I'm leaving the issue open because there were other things in the OP which we don't have alignment on, yet.

sffc commented 1 month ago

Fieldsets are tracked in #1317 and #4991 tracks the data markers