Open wetneb opened 3 years ago
I posted a separate issue #55 of which this one is an instance.
I think we should allow language preferencing, similar to Accept-Language
and wikidata label service
. See https://github.com/w3c/sparql-12/issues/13.
Below are draft requirements:
language
is a special param with these requirements:
Accept-Language
and SPARQL langMatches. Eg "en" should match a name with lang tag "en-GB".The Ontotext Platform allows more flexible lang specification, including negations, see this comment. But I think we don't need such advanced features?
I think it makes sense to treat this as a special case of #55 indeed. Thanks for drafting these specs, that looks pretty neat. I guess we'd need to make progress on #55 first since we would rely on this notion of parameter.
In our last call it was discussed that we could simply let the client specify their language using the standard HTTP header Accept-Language
(which @VladimirAlexiev already mentions above - but I would not introduce a new language
parameter for that: just use the HTTP header).
This header would control the language in which the names and descriptions of entities, properties and types should be returned.
This would have benefits:
Downsides:
@wetneb Hmm which allows greater control for users who are ultimately behind those clients? What are the pros and cons for users dealing with multiple languages per project and work in batches between languages? Does one approach reduce user control considerably? Could a clients workflow be adapted to still provide good user control?
Even web-based clients can control which value they send in the Accept-Language
header (for instance with the fetch
JS API) so I would say so!
@wetneb The Accept-Language page lists two considerations that align with @thadguidry's questions above:
So I think we should allow an explicit language
parameter, and use Accept-Language
as default.
One of the requirements needs to be modified:
- The language param can take several lang tags separated with commas, in which case they are interpreted as preference order.
Accept-Language
, each lang tag can have an optional ;q=
quality value (also called "q-factor" or "weight"). Quality values are relative and the default is ;q=1.0
Accept-Language
lang tags are sorted by quality value using a stable sortlanguage
or sorted as per Accept-Language
) is used as preference orderExample: assume this header:
Accept-Language: en;q=0.1, en-US, en-GB
and assume an entity has this set of labels. Then the service should return the selected label for display:
en-GB
: return en-GB
en-GB, en-US
: return en-US
en, en-GB, en-US
: return en-US
en-NZ, fr
: return en-NZ
because it matches en
fr
: return fr
as last fallback (as if *,q=0.01
was specified last)So I think we should allow an explicit
language
parameter, and useAccept-Language
as default.
As soon as you introduce an explicit language
parameter, you are then expecting that the reconciliation client makes use of it to let the user select a language.
But if the reconciliation client does let the user pick a language, then why can't it just pass on this language to the server with a header instead of a GET/POST parameter?
Reconciliation clients, even web-based, will be able to set such a header independently of the browser's defaults.
So I would rather stick with a single, standard way to define the language.
I'm starting to implement the reconciliation API and this is one of the key issues I'm struggling a bit with (without really being blocked). In our case another param would be helpful: the language associated with the query. For instance in our database we have a lot of transliterated names (originally in Tibetan, Sanskrit, Burmese, Khmer, etc.) and the search sometimes can't really guess in which language a query should be made, and it makes a lot of difference in the results. Perhaps this should be a separate query but since the title of this one is "Multilingual support", I thought this could be relevant. Thanks for this wonderful API BTW!
@eroux thanks for chiming in! If that parameter was supplied in a header, would that make it any more difficult for you to rely on it?
the parameter for the expected language of the results can be in a header yes (in fact Accept-Language
seems very standard for that), no problem.
the parameter for the queries should be with the queries I think, perhaps something like
{
"q1": {
"query": "Hans-Eberhard Urbaniak",
"query_lang": "en"
}
}
In our case another param would be helpful: the language associated with the query. For instance in our database we have a lot of transliterated names (originally in Tibetan, Sanskrit, Burmese, Khmer, etc.) and the search sometimes can't really guess in which language a query should be made, and it makes a lot of difference in the results. [...] the parameter for the queries should be with the queries I think
This fits well with the internationalization guidelines from W3C, which recommend that specifications provide separate methods for expressing (1) the language of the intended audience vs. (2) the text-processing language for a specific text range (see https://www.w3.org/International/questions/qa-text-processing-vs-metadata). So for (1) we could take the HTTP header approach from https://github.com/reconciliation-api/specs/pull/108, and for (2) we could add optional language fields to the JSON.
This would probably make sense for all objects (queries, properties, property values, candidates, candidate types, features). The language could apply to all fields of that object (e.g. name
and description
of candidates
), and to all contained objects (like all properties
of queries
), unless they override the container setting, e.g.:
{
"queries": [
{
"query": "Deng Shuping",
"lang": "en",
"properties": [
{
"pid": "professionOrOccupation",
"v": "art historian"
},
{
"pid": "variantName",
"v": "鄧淑蘋",
"lang": "zh-Hant"
}
]
}
]
}
Here, the query
(explicit) and the first properties.v
(inherited) have text-processing language en
, the second properties.v
overrides it with zh-Hant
.
(The default / override logic is also part of the W3C guidelines, see https://www.w3.org/TR/international-specs/#lang_inherit.)
I totally agree! Another way of encoding it would be the JSON-LD way:
{
"pid": "variantName",
"v": {
"@value": "鄧淑蘋",
"@language": "zh-Hant"
}
}
In today's call @fsteeg mentioned that we should not have to worry too much about JSON-LD to determine the format of our JSON: it should be possible to add the right JSON-LD context to map our JSON structure to RDF appropriately. So we are inclined to go for @fsteeg's JSON structure above.
We might also need to support passing along the language of an entity used as property value:
{
"pid": "foo",
"v" : [ {
"id": "Q344",
"name": "some entity",
"lang": "en"
} ]
}
Hi @wetneb ! Regarding your last commit (only reading its title, not the code):
Accept-Language
, and we discussed it aboveContent-Language
is suitable for a doc written in one language. I don't think Recon results fit that description: even the different matches of one query may carry different language.
Imagine this situation:
Accept-Language: bg, en;q=0.5
John Philips (en)
and Q23 Иван Филипов (bg) = John Philips (en)
matches: [
{id: Q12, name: John Philips, lang: en},
{id: Q23, name: Иван Филипов, lang: bg}
]
What's the Content-Language
of this document? It's neither bg
nor en
, because it's mixed
Agreed, it's redundant with the inclusion of the language in the JSON payloads, which we want to do in another change. So I would just remove the Content-Language
header.
The server has two person items Q12
John Philips (en)
and Q23Иван Филипов (bg) = John Philips (en)
[...] What's theContent-Language
of this document? It's neitherbg
noren
, because it's mixed
The Content-Language
can actually contain both languages, so here it could be bg, en
. This is what the W3C refers to as the metadata language or language of the intended audience, which can be multiple languages (see Types of language declaration). To express which string is in which (single) language, we added a section on setting the text processing language (in #129), which seems to basically work like your example.
@VladimirAlexiev what do you think about @fsteeg's understanding of Content-Language
above? If that sounds good to you we'd merge the PR #108.
At the moment, the names and descriptions represented in the reconciliation queries and responses do not come with any language information.
For natively multilingual data sources (such as Wikidata) it would be convenient if this restriction could be lifted. At the moment, this is handled by offering one reconciliation endpoint per language supported by Wikidata (such as https://wikidata.reconci.link/en/api, https://wikidata.reconci.link/it/api, and so on), but it would be nicer to have a single endpoint which would support all languages directly. (It is complicated to teach users to insert the language code in the URL).
The lack of multilingual support was identified by a reviewer from the Ontology Matching 2020 workshop and earlier by @tfmorris in https://github.com/reconciliation-api/specs/pull/48#issuecomment-665802855.
There are other endpoints which also encode some "constants" in the base URL of the reconciliation service. For instance, the OpenCorporates endpoint does this not for languages but for jurisdictions (letting users match against companies from a single country). https://api.opencorporates.com/documentation/Open-Refine-Reconciliation-API
So perhaps the right way to address this is to provide a better way for services to receive global configuration options, which would encompass the use cases of both the Wikidata and OpenCorporates endpoints?