Multilingual support - Githubissues

wetneb commented 3 years ago

At the moment, the names and descriptions represented in the reconciliation queries and responses do not come with any language information.

For natively multilingual data sources (such as Wikidata) it would be convenient if this restriction could be lifted. At the moment, this is handled by offering one reconciliation endpoint per language supported by Wikidata (such as https://wikidata.reconci.link/en/api, https://wikidata.reconci.link/it/api, and so on), but it would be nicer to have a single endpoint which would support all languages directly. (It is complicated to teach users to insert the language code in the URL).

The lack of multilingual support was identified by a reviewer from the Ontology Matching 2020 workshop and earlier by @tfmorris in https://github.com/reconciliation-api/specs/pull/48#issuecomment-665802855.

There are other endpoints which also encode some "constants" in the base URL of the reconciliation service. For instance, the OpenCorporates endpoint does this not for languages but for jurisdictions (letting users match against companies from a single country). https://api.opencorporates.com/documentation/Open-Refine-Reconciliation-API

So perhaps the right way to address this is to provide a better way for services to receive global configuration options, which would encompass the use cases of both the Wikidata and OpenCorporates endpoints?

VladimirAlexiev commented 3 years ago

I posted a separate issue #55 of which this one is an instance. I think we should allow language preferencing, similar to Accept-Language and wikidata label service. See https://github.com/w3c/sparql-12/issues/13.

Below are draft requirements:

language is a special param with these requirements:

The language param values should conform to BCP47 and preferably be selected from the IANA Language Subtag Registry (or see the Google Sheet iana-lang-tags for easier access)
Language matching should conform to RFC4647 Matching of Language Tags, as used in HTTP Accept-Language and SPARQL langMatches. Eg "en" should match a name with lang tag "en-GB".
The language param can take several lang tags separated with commas, in which case they are interpreted as preference order.
The service should prefer matches in the specified language(s) but can also return matches in other languages
The service should return entity names (and descriptions) in the specified language(s), but can fall-back to any other language if the entity has no name in the specified language(s)

The Ontotext Platform allows more flexible lang specification, including negations, see this comment. But I think we don't need such advanced features?

wetneb commented 3 years ago

I think it makes sense to treat this as a special case of #55 indeed. Thanks for drafting these specs, that looks pretty neat. I guess we'd need to make progress on #55 first since we would rely on this notion of parameter.

wetneb commented 1 year ago

In our last call it was discussed that we could simply let the client specify their language using the standard HTTP header Accept-Language (which @VladimirAlexiev already mentions above - but I would not introduce a new language parameter for that: just use the HTTP header).

This header would control the language in which the names and descriptions of entities, properties and types should be returned.

This would have benefits:

not reinventing the wheel: just rely on a standard feature of HTTP
web-based reconciliation clients will have this header set by the browser directly, without the app needing to integrate it

Downsides:

when formulating a query manually, it would not be possible to set the language explicitly in the URL itself

thadguidry commented 1 year ago

@wetneb Hmm which allows greater control for users who are ultimately behind those clients? What are the pros and cons for users dealing with multiple languages per project and work in batches between languages? Does one approach reduce user control considerably? Could a clients workflow be adapted to still provide good user control?

wetneb commented 1 year ago

Even web-based clients can control which value they send in the Accept-Language header (for instance with the fetch JS API) so I would say so!

VladimirAlexiev commented 1 year ago

@wetneb The Accept-Language page lists two considerations that align with @thadguidry's questions above:

The content of Accept-Language is often out of a user's control (when traveling, for instance).
A user may also want to visit a page in a language different from the user interface language.

So I think we should allow an explicit language parameter, and use Accept-Language as default.

One of the requirements needs to be modified:

The language param can take several lang tags separated with commas, in which case they are interpreted as preference order.

The language param can take several lang tags separated with commas.
- In Accept-Language, each lang tag can have an optional ;q= quality value (also called "q-factor" or "weight"). Quality values are relative and the default is ;q=1.0
- Accept-Language lang tags are sorted by quality value using a stable sort
- The list (as given in language or sorted as per Accept-Language) is used as preference order

Example: assume this header:

Accept-Language: en;q=0.1, en-US, en-GB

and assume an entity has this set of labels. Then the service should return the selected label for display:

en-GB: return en-GB
en-GB, en-US: return en-US
en, en-GB, en-US: return en-US
en-NZ, fr: return en-NZ because it matches en
fr: return fr as last fallback (as if *,q=0.01 was specified last)

wetneb commented 1 year ago

So I think we should allow an explicit language parameter, and use Accept-Language as default.

As soon as you introduce an explicit language parameter, you are then expecting that the reconciliation client makes use of it to let the user select a language.

But if the reconciliation client does let the user pick a language, then why can't it just pass on this language to the server with a header instead of a GET/POST parameter?

Reconciliation clients, even web-based, will be able to set such a header independently of the browser's defaults.

So I would rather stick with a single, standard way to define the language.

eroux commented 1 year ago

I'm starting to implement the reconciliation API and this is one of the key issues I'm struggling a bit with (without really being blocked). In our case another param would be helpful: the language associated with the query. For instance in our database we have a lot of transliterated names (originally in Tibetan, Sanskrit, Burmese, Khmer, etc.) and the search sometimes can't really guess in which language a query should be made, and it makes a lot of difference in the results. Perhaps this should be a separate query but since the title of this one is "Multilingual support", I thought this could be relevant. Thanks for this wonderful API BTW!

wetneb commented 1 year ago

@eroux thanks for chiming in! If that parameter was supplied in a header, would that make it any more difficult for you to rely on it?

eroux commented 1 year ago

the parameter for the expected language of the results can be in a header yes (in fact Accept-Language seems very standard for that), no problem.

the parameter for the queries should be with the queries I think, perhaps something like

{
  "q1": {
    "query": "Hans-Eberhard Urbaniak",
    "query_lang": "en"
  }
}

fsteeg commented 1 year ago

In our case another param would be helpful: the language associated with the query. For instance in our database we have a lot of transliterated names (originally in Tibetan, Sanskrit, Burmese, Khmer, etc.) and the search sometimes can't really guess in which language a query should be made, and it makes a lot of difference in the results. [...] the parameter for the queries should be with the queries I think

This fits well with the internationalization guidelines from W3C, which recommend that specifications provide separate methods for expressing (1) the language of the intended audience vs. (2) the text-processing language for a specific text range (see https://www.w3.org/International/questions/qa-text-processing-vs-metadata). So for (1) we could take the HTTP header approach from https://github.com/reconciliation-api/specs/pull/108, and for (2) we could add optional language fields to the JSON.

This would probably make sense for all objects (queries, properties, property values, candidates, candidate types, features). The language could apply to all fields of that object (e.g. name and description of candidates), and to all contained objects (like all properties of queries), unless they override the container setting, e.g.:

{
  "queries": [
    {
      "query": "Deng Shuping",
      "lang": "en",
      "properties": [
        {
          "pid": "professionOrOccupation",
          "v": "art historian"
        },
        {
          "pid": "variantName",
          "v": "鄧淑蘋",
          "lang": "zh-Hant"
        }
      ]
    }
  ]
}

Here, the query (explicit) and the first properties.v (inherited) have text-processing language en, the second properties.v overrides it with zh-Hant.

(The default / override logic is also part of the W3C guidelines, see https://www.w3.org/TR/international-specs/#lang_inherit.)

eroux commented 1 year ago

I totally agree! Another way of encoding it would be the JSON-LD way:

{
          "pid": "variantName",
          "v": {
              "@value": "鄧淑蘋",
              "@language": "zh-Hant"
          }
}

wetneb commented 1 year ago

In today's call @fsteeg mentioned that we should not have to worry too much about JSON-LD to determine the format of our JSON: it should be possible to add the right JSON-LD context to map our JSON structure to RDF appropriately. So we are inclined to go for @fsteeg's JSON structure above.

We might also need to support passing along the language of an entity used as property value:

{
   "pid": "foo",
   "v" : [ {
        "id": "Q344",
        "name": "some entity",
        "lang": "en"
   } ]
}

VladimirAlexiev commented 1 year ago

Hi @wetneb ! Regarding your last commit (only reading its title, not the code):

I understand Accept-Language, and we discussed it above
But Content-Language is suitable for a doc written in one language. I don't think Recon results fit that description: even the different matches of one query may carry different language. Imagine this situation:
- I query for "John Philips" and specify Accept-Language: bg, en;q=0.5
- The server has two person items Q12 John Philips (en) and Q23 Иван Филипов (bg) = John Philips (en)
- The server should return this JSON (it's wrong in too many ways to list, but you get the idea)
```
matches: [
{id: Q12, name: John Philips, lang: en},
{id: Q23, name: Иван Филипов, lang: bg}
]
```
  What's the Content-Language of this document? It's neither bg nor en, because it's mixed

wetneb commented 1 year ago

Agreed, it's redundant with the inclusion of the language in the JSON payloads, which we want to do in another change. So I would just remove the Content-Language header.

fsteeg commented 1 year ago

The server has two person items Q12 John Philips (en) and Q23 Иван Филипов (bg) = John Philips (en) [...] What's the Content-Language of this document? It's neither bg nor en, because it's mixed

The Content-Language can actually contain both languages, so here it could be bg, en. This is what the W3C refers to as the metadata language or language of the intended audience, which can be multiple languages (see Types of language declaration). To express which string is in which (single) language, we added a section on setting the text processing language (in #129), which seems to basically work like your example.

wetneb commented 1 year ago

@VladimirAlexiev what do you think about @fsteeg's understanding of Content-Language above? If that sounds good to you we'd merge the PR #108.

reconciliation-api / specs

Multilingual support #52