reconciliation-api / specs

Specifications of the reconciliation API
https://reconciliation-api.github.io/specs/draft/
30 stars 9 forks source link

Tweak language content to address remaining checklist issues #136

Closed fsteeg closed 6 months ago

fsteeg commented 10 months ago

For details see #125 (where each item links to the relevant part of the W3C i18n docs).

awagner-mainz commented 10 months ago

I am sorry for being late to comment here. I am a bit worried that a use case of mine that has been possible to solve in past versions may end up no longer being so:

I have used a reconciliation service for a multilingual SKOS vocabulary to not only get the Identifier for a concept, but also the preferredLabel in a language that I would specify, independently from the language of the data I had. This made it possible to normalize data where one field has been supplied in different languages. Say, I have a field "subject matter" and it contains both German and Danish values for the same concept. I would ask the service to return reconciliation results in English. The service happily found the relevant entries among the labels in different languages it has for its concepts, and it returned the concept identifier and the preferredLabel in english, if available (and an empty label field if there was no english label). Besides the identifiers themselves, I could thus fill a new column in OpenRefine with the english preferred label for all the rows.

If I want to reproduce this now, I will obviously set the Accept-Language header to "en", but what and where should I specify as the text processing language? In fact I want the query to compare the query term against text fields in more or less all the languages that the authority database has. I guess I have to set a text-processing language somewhere, be it only to avoid any one default language eventually defined by the authority data publisher.

Is it just me or is the W3C i18n Best Practices geared very much towards data publication rather than querying?

wetneb commented 10 months ago

How about not specifying any text processing language?

The Accept-Language: en header does not imply that the values you are supplying to the service are in the same language, I think.

awagner-mainz commented 10 months ago

The Accept-Language: en header does not imply that the values you are supplying to the service are in the same language, I think.

But I want all results to supply the english label. What labels will not setting the Accept-Language produce?

How about not specifying any text processing language?

The spec currently says: "If no explicit text-processing language is given, the metadata language (the language of the intended audience) provided first (see service definition) is considered the default text-processing language." If I did provide the "en" metadata language tag (see above), then that would make the reconciliation service consider only the english labels for matching, no?

fsteeg commented 10 months ago

I am a bit worried that a use case of mine that has been possible to solve in past versions may end up no longer being so.

Anything that worked before should still be possible, since none of the language-related changes are mandatory. These are all SHOULD or MAY. Maybe we need to be clearer about that in the spec?

If I did provide the "en" metadata language tag (see above), then that would make the reconciliation service consider only the english labels for matching, no?

No, it only means that the service should assume that the language of the intended audience is English (metadata language) and that the provided labels are in English (default text-processing language, if none is set). What the service does with that information, or if it needs it at all, is up to the service.

wetneb commented 10 months ago

No, it only means that the service should assume that the language of the intended audience is English (metadata language) and that the provided labels are in English (default text-processing language, if none is set)

Maybe it makes sense to remove this last assumption, no? In the context of OpenRefine, I would expect that we set the Accept-Language header to the language used by the user for the interface (or any other language specified specifically for that service, if we have the UI for that), but that does not mean that the data they are working on is in that language. So I'd find it good that services do not assume that this header is a sensible text-processing language.

fsteeg commented 10 months ago

Maybe it makes sense to remove this last assumption, no?

The reason for that was basically this requirement from the checklist (#125):

If there is only one language declaration for a resource, and it has more than one language tag as a value, it must be possible to identify the default text-processing language for the resource. -- #lang_mixing

So that's for the case of more than one language in the header, but it also addresses this:

The specification should indicate how to define the default text-processing language for the resource as a whole. -- #lang_whole_res

The latter could be solved with a lang attribute in the manifest (like dir in #137), but from the former it seems like we would still have to address the case where we only have a header with more than one language.

I think we can just consider that default text-processing language as a hint. The service can use that information to process the passed values, but if it does not actually need info on the text-processing language, it can simply ignore the fact that there is a default text-processing language.

awagner-mainz commented 10 months ago

I am very sorry for not showing up in today's meeting. I was mistakenly under the impression that we would be meeting tomorrow. I apologize!

Having just seen the minutes of today's meeting I think leaving out the intended audience language being interpreted as the default text processing language will definitely ease my worries. Everything else, like pros and cons of presuming a text processing language, how to reflect the intended audience language in the query results or example scenarions and best practices, is maybe better discussed on a wiki page or something like that.

But as we are already discussing this: Would it make sense to also reconsider the sentence "The lang value MUST be a single well-formed [BCP 47] language tag." in the beginning of section 8.3? Why should a query not indicate that it intends the query term to be processed in two languages? Again, this is much more about query terms than other fields. (And I acknowledge that I can either send two queries for the same term with two different languages, or not specify a text processing language in the request at all, thereby (hopefully, depending on the service) falling back to "all the languages".) Sorry if this is beating a dead horse.

fsteeg commented 9 months ago

Why should a query not indicate that it intends the query term to be processed in two languages?

I think the main misunderstanding here is that the text-processing language is not an instruction of any kind, to tell the service how or what to process, but an information about the language that a specific string is in. A service can always decide not to care about the language of a given string, and e.g. search for matches in all languages etc. To quote from the W3C docs:

So we are, by necessity, talking about associating a single language with the text, or some range of text, within the resource. Whereas the intended audience can be speakers of more than one language, a specific range of text can only be in one language at a time. -- W3C: Types of language declaration

We should probably make clear in the spec what the text-processing language actually is. Assigning myself and switching this to a draft PR (seems no longer possible, probably since it's been reviewed) for that, and for the removal of the default statement (plus an alternative for setting the text-processing language globally).

thadguidry commented 9 months ago

"Text-processing" sounds so ambiguous. We should maybe say "String language" or "human language of the string represented". I'd ideally and more formally (since we're describing an API spec) would rather call it "byte string" since for example UTF-8 can take up anywhere between 1-4 bytes.

fsteeg commented 7 months ago

Addressed the remaining issues here:

"Text-processing" sounds so ambiguous. We should maybe say "String language" or "human language of the string represented".

I think we should stick with the terminology from the W3C best practice docs. I hope with the added definitions the ambiguity is gone and people reading the spec will understand what that is about.