w3c / string-meta

How to add direction and language metadata to strings
https://w3c.github.io/string-meta/
12 stars 18 forks source link

Should we expect all strings to have direction metadata? #66

Open r12a opened 2 years ago

r12a commented 2 years ago

Let's consider a practical scenario where we have a message file containing all the 2,000 natural language strings needed for an Arabic translation of an application's UI and error messages. Let's imagine that the message set contains 10 strings which would produce the wrong direction if first-strong heuristics were applied, or if all strings were expected to have a RTL base direction (eg. Arabic strings that begin with Latin characters, or Mac addresses, or untranslated strings, etc.)

My understanding of what we say in string-meta is that it should be possible to associate direction metadata with all strings in a string set. However, we don't require, or expect every string to have direction metadata. We do, however, expect every string that differs from the default to have direction metadata explicitly assigned to it.

This applies if the message set as a whole has a way to set a default direction for all strings. In this case, this would probably be a file-wide field near the top of the file setting the default base direction for all strings as RTL. Strings that shouldn't have a RTL base direction, must each be labelled for direction (LTR), to override the default setting.

I think that we also say this for strings when there is no default declared at the top of the file. This is on the assumption that, in the absence of direction information, the consumer will use first-strong heuristics to determine the direction. Again, any string that won't produce the correct result via those heuristics will need to have direction metadata associated with it.

While it is not a problem if every string is labelled for direction, i think that the reason for not requiring that is as follows: if every string in the resource has to be labelled, that has to be done by human intervention. A machine is not capable of identifying all 10 strings that should have a LTR base direction. (If a machine could do that, we wouldn't need direction metadata anyway, because the consumer would be able to simply apply the appropriate heuristics.) Therefore, correct labelling requires human intervention. It seems to me that requiring 1,990 strings in a set of 2,000 to be explicitly labelled by hand is too much to ask. Labelling just the 10 strings that would produce incorrect results, however, is achievable and essential.

Note also that requiring every string to have direction metadata explicitly assigned would also invalidate the usefulness of a resource-wide rule or field that sets the default direction (by making it redundant).

aphillips commented 2 years ago

I mostly agree with what you said above, but I want to call out that machines are not just limited in their ability to identify which strings need individual metadata but also in their ability to decide when that data is needed.

In designing (for example) a resource file format, a good optimization is to have a file-level default direction so that only specifically directional strings need to include or override the value. In that case, within that file metadata can be omitted where it does not differ from the file-level base direction. However, when the file is sent for translation, most translation systems break out each individual string resource into a "segment". The segments are sent to translators individually and thus each segment needs to include a slot for the base direction (so that machines aren't guessing what the direction is at the end).

When the slot is unfilled "because it is using the default value", you have to know what the default value is. You're expecting humans to make these decisions ("this item has unusual directionality") and then have machines decide whether the metadata was meaningful. This can work in the example of a resource file I give above: the machine might only keep per-string direction metadata when assembling the Arabic translated file that does not match the file-level base direction (which is presumably rtl for the Arabic).

Notice that a lot of formats work differently. We see a lot of specs that use "language maps" as a form of localization support. So we see stuff like:

"someValue": {
    "en": "85 inch SmartTV",
    "fr": "85 pouces SmartTV",
    "ar": "85 بوصة  SmartTV" // needs dir=rtl !!

And the need to set the direction in this case does not depend on the first-strong requirement. The consumer (let's say it's in an HTML context) needs to bidi isolate and set the direction on the ar string. Otherwise they will have to guess the direction from the language tag. The producer doesn't know who the consumer is or how they will consume the strings. So probably direction metadata needs to appear here?

So what I guess I'm saying is that the situation is more complex. We will tend to recommend that specifications:

If the Arabic translation in your comment doesn't have a file-level base direction, then, yeah, every string would have dir=rtl (with a few exceptions that have ltr) so that the consumer can just use the strings directly, for example in filling in a template (e.g. <p lang=$item.lang dir=$item.dir>$item.value</p>).

String-meta as it currently sits I think allows for reasonable omission of direction metadata for when it is all the same or when it can be reasonably defaulted and it would be crazy to expect every string to actually populate language and direction (so long at the language and direction are identified somewhere). But it should be possible to compute the language and direction for any string as-if the string had local-to-the-string metadata, right?

r12a commented 2 years ago

My summary of this conversation is that we are not disagreeing with each other. It should always be possible to associate metadata with each string. It may not be necessary to actually associate a language/direction value to all strings in storage if resource-wide default metadata values can be associated with each string when the consumer tries to use it.

That may mean filling in the property values for each string during transmission, or it may i suppose mean sending some information about the expected default direction, which is then applied to each string as the consumer renders it (unless, of course, it already carries its own metadata).

So much for strings for which metadata has been set.

Speaking of direction, some sets of strings may either: a. just have no direction metadata at all (particularly legacy but also most current implementations) b. have per-string properties to express direction but only use them for strings that aren't correctly detected by first-strong heuristics. In other words, no resource-wide metadata.

In both cases we have a rule that says that consumers should use first-strong heuristics to determine the direction of each string if there is no metadata available. This should allow correct rendering for the majority of strings.

As we discussed in last week's telecon, passing or storing an auto value for each string is probably not necessary, given that we have the rule about applying first-strong heuristics. However, an auto value could perhaps be useful at the resource-wide, default level. If strings have been accumulated from sources which didn't provide direction metadata, it would not be appropriate to set a resource-wide default of LTR or RTL, and using auto may be useful to indicate explicitly that LTR or RTL defaults are not appropriate.