reconciliation-api / specs

Specifications of the reconciliation API
https://reconciliation-api.github.io/specs/draft/
31 stars 9 forks source link

Support for multi-lingual candidate names #138

Open saumier opened 1 year ago

saumier commented 1 year ago

As a service provider, I would like clients to be able to query in any language and to return candidate names in one or more languages specified by the client request.

Use Case

A client is reconciling a place in Canada using the Artsdata.ca Reconciliation service with the name "Studio Azrieli".

Current solution (not ideal)

The service returns multiple entities including K11-15 "National Arts Centre - Azrieli Studio" and K11-15 "Centre National des Arts - Studio Azrieli" which appear as separate entities but have the same URI. This may appear incorrect to the user because there are 2 candidates. If the user doesn't notice that they have the same URI then they may be mistaken as duplicates.

Screenshot 2023-09-14 at 9 52 11 AM

Ideal solution

The service returns multiple entities but only a single K11-15 displaying both names "National Arts Centre - Azrieli Studio" and "Centre National des Arts - Studio Azrieli" together. Parameters can specify the languages the client would like to display.

fsteeg commented 11 months ago

So on the protocol level, would this mean to allow arrays of objects for candidate name and description?

"candidates": [
  {
    "id": "K11-15",
    "name": [
      {
        "str": "National Arts Centre - Azrieli Studio",
        "lang": "en"
      },
      {
        "str": "Centre National des Arts - Studio Azrieli",
        "lang": "fr"
      }
    ],
    ...
  }
]
wetneb commented 11 months ago

If we go down that route, I wonder if we should also add support for that for multiple names for properties (when returned in a property suggest response, or in a data extension response) or for types (when returned in a type suggest response, or in a reconciliation response as part of the reconciliation candidates). I guess it would make things look more uniform but I am not really sure about the use case. What do you think @saumier?

saumier commented 6 months ago

If we go down that route, I wonder if we should also add support for that for multiple names for properties (when returned in a property suggest response, or in a data extension response) or for types (when returned in a type suggest response, or in a reconciliation response as part of the reconciliation candidates). I guess it would make things look more uniform but I am not really sure about the use case. What do you think @saumier?

Yes. Since the group is not recommending JSON-LD, then I think this is the next best approach.

I am implementing a bilingual website (en, fr) that implements a client for the reconciliation API here kg.artsdata.ca. The UI of this site can switch between English and French. When querying using the reconciliation API, a query string can be in any language. For example I could query a Place using "Studio Azrieli" and "Azrieli Studio". The response would return candidates including K11-15. With this new approach, the website could display the name and description in the UI language.

Also good for add support for property and type suggestions.

wetneb commented 6 months ago

Summary of our discussion on the monthly call of last month: we could either

Maybe there are other options?

We thought that it is worth bringing more attention to this issue from the broader community, to gather more feedback.

tfmorris commented 6 months ago

Unless the variable structure is backward compatible when the simple variant is used, I think it's better to be consistent and always use the array form, even for a single entry. I suspect that things have diverged enough that there's not a compatibility benefit.

thadguidry commented 6 months ago

I second @tfmorris opinion. I like the consistency of when our API standards have a context that could be "one or many" then we resort to Array form. (mostly because the idea of simpler JSON structure, is precluded that perhaps JSON Array objects are complicated or noisy?, when they really are not for developers and our 2024+ tooling nowadays)

acka47 commented 5 months ago

Generally, this seems to be related to #52 as a solution to this issue will also resolve the #52, won't it?

Maybe there are other options?

I am late to the party (sorry) but am adding this for reference. Generally, I like the "language map" approach from JSON-LD (examples) for providing labels in multiple languages as it is simple, terse and easy to read. The example from https://github.com/reconciliation-api/specs/issues/138#issuecomment-1803585218 would look like this with language maps:

{
   "candidates":[
      {
         "id":"K11-15",
         "name":{
            "en":"National Arts Centre - Azrieli Studio",
            "fr":"Centre National des Arts - Studio Azrieli"
         }
      }
   ]
}
thadguidry commented 5 months ago

@acka47 If we went that route, we'd have to adopt a convention and document it. That being the key should be an ISO 639-3 three letter code? Hmm, what else?

wetneb commented 5 months ago

@acka47 I like the conciseness but how would a service represent a name or description for which it does not know the language? (Use case: a tool like CSV-reconcile, which spins a reconciliation service on arbitrary datasets, generally will not have access to this sort of information and shouldn't make up a language for the sake of fitting in)

acka47 commented 5 months ago

If we went that route, we'd have to adopt a convention and document it. That being the key should be an ISO 639-3 three letter code?

Yes, we could define it similar to JSON-LD like this: "keys must be strings representing [BCP47] language codes and the values must be a string."

how would a service represent a name or description for which it does not know the language?

Good question. I guess for the other approach from https://github.com/reconciliation-api/specs/issues/138#issuecomment-1803585218 you would you just omit the optional lang key. With the language map approach you would have to use und as key (for "undetermined"), I guess.

awagner-mainz commented 5 months ago

Would the array approach allow for multiple alias names in the same language whereas the map approach would not? That could be an argument for choosing the array approach. On the other hand, I am not sure we actually want to allow this?

fsteeg commented 5 months ago

Another aspect to consider for the lang field vs. language maps is that the field provides a general approach for all objects. To quote from the current draft:

All objects used in this protocol (entities, types, properties, queries, candidates, features, etc.) MAY declare an explicit text-processing language in a lang field.

fsteeg commented 5 months ago

[...] I think it's better to be consistent and always use the array form [...]

To be clear, this is not only about array vs. non-array, but also object vs. string.

The common, simple case currently:

"name": "National Arts Centre - Azrieli Studio"

The common case in the unified syntax:

"name": [
  {
    "str": "National Arts Centre - Azrieli Studio"
  }
]

If this was the first and only place where we introduce optional structure (string or array of objects), I'd agree we might want to avoid that. But since we do the same thing in other places (e.g. property values), I feel like the much simpler common case is worth having the option.

saumier commented 5 months ago

how would a service represent a name or description for which it does not know the language?

From JSON-LD https://www.w3.org/TR/json-ld/#example-102-indexing-languaged-tagged-strings-using-none-for-no-language

... the special index @none is used for indexing strings which do not have a language; this is useful to maintain a normalized representation for string values not having a datatype.

Example if there was no language for a name.

{
   "candidates":[
      {
         "id":"K11-15",
         "name":{
            "@none":"National Arts Centre - Azrieli Studio"
         }
      }
   ]
}
wetneb commented 5 months ago

I'm not really enthusiastic about any of the solutions, but the one that I find the least bad is @fsteeg's suggestion to use the existing language (+ text direction) mechanisms we have, and simply switch to this default syntax:

"name": [
  {
    "str": "National Arts Centre - Azrieli Studio"
  }
]

with the option to add a lang and dir attributes at the same level as the str if needed, and to add more objects in the array. This also has the benefit of allowing for returning multiple names in a same language (for alternate names, such as acronyms for instance).

wetneb commented 5 months ago

And I agree with @tfmorris on the preference to stick to the array form.

saumier commented 5 months ago

I also agree with @wetneb and @tfmorris to use an array of objects with the str attribute and optional lang and dir.

For the sake of comparison with other patterns, this somewhat resembles the keys @value, @language and @direction used in JSON-LD.

acka47 commented 5 months ago

I have no preference here but just felt that the language map approach should at least be discussed in this context. Thus, I am fine with an array of objects containing at least the str with optional lang and dir.

saumier commented 2 months ago

@wetneb My team has implemented an endpoint for the current draft spec and updated our branch of the test bench to support both v0.2 and v0.3 (draft).

Here are 2 screen grabs from our branch of test bench. One showing our production reconciliation endpoint v0.2 and a second screen grab showing our test reconciliation endpoint v0.3 with multi-lingual support meeting the needs of this use case. This is a work in progress.

v0.2 - current spec - showing Azieli Studio returned 2 times with the same ID K11-15

Screenshot 2024-07-08 at 10 11 54 AM

v0.3 - draft spec - showing Azrieli Studio entity combined in a single response with en and fr.

Screenshot 2024-07-08 at 10 18 29 AM