Resolving labels with language negotiation

pietercolpaert commented 5 years ago

Wikidata describes a clear use case for resolving labels for entities and falling back when a language would not be available. See https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries#US_presidents_and_their_spouses

Can we come up with a proposal that standardizes this for all SPARQL query processors instead of having to rely on service-specific "connectors"?

cygri commented 5 years ago

Related prior work: In TopBraid products, there is an extension function ui:label(?resource) that retrieves a label for the resource in a way that we have found practical. It looks up a number of properties, including rdfs:label, skos:prefLabel and their sub-properties, and falls back to using the local name of the URI. It returns the label best matching the user's language preferences, which are communicated outside of the query, in HTTP's Accept-Language header.

There is also a labelInGraph version that looks up the label in a different graph; it is useful to get labels for classes and properties, which usually live in a separate graph in our way of doing things.

It has been argued that using a SPARQL function to look up stuff elsewhere in the graph/dataset is not quite kosher, as functions should always return the same results for the same arguments. It works for us.

VladimirAlexiev commented 5 years ago

Next someone would want a query to fetch appropriate labels depending on resource type. Ah, that's already happened:

examples: http://vocab.getty.edu/doc/#Resource_Titles
query: http://vocab.getty.edu/doc/queries/#Smart_Resource_Title

joernhees commented 5 years ago

related: rdflib's preferredLabel

i'm not sure though if we aren't actually talking about 2 (or maybe more) issues here (or if they should be solved together):

label predicate preference (e.g., give me a skos:prefLabel if one exists, otherwise a rdfs:label)
literal language (and potential datatype) preference (e.g., give me an @en label, otherwise a @de, otherwise a plain literal (no langtag, no datatype), otherwise any language tagged literal, otherwise a ^^xml:int, otherwise any datatype)

kasei commented 5 years ago

The Wikidata approach seems to be doing two things at once:

Accessing label-like properties of a resource
Restricting that label access to just data matching a specific language (or ranked list of preferred languages)

Both of these seem interesting, and might be worthy of discussion independently as well as in combination.

In Kineo, I've worked on a similar issue to (2) above by implementing a solution based on content-negotiation Accept-Language headers. It solves many of the same issues as Wikidata's wikibase:language approach, but operates at the level of an entire request, not just on specific labels.

RickMoynihan commented 5 years ago

I 100% agree that this is awkward in pure SPARQL, though I can't think of a clean way to solve it that retains purity of functions; I agree it's a frustrating and common problem.

I often find we want a label from several resources in a query; each label of which may require a fallback. Typically I handle this by using a construct to construct all types of label, and then filter them / prioritise them in code elsewhere... but then you're adding OPTIONALs/VALUES or over selecting somehow.

ktk commented 5 years ago

Wikidata describes a clear use case for resolving labels for entities and falling back when a language would not be available.

We can probably discuss the way Wikidata implements it in SPARQL, but having a priority/fallback option is absolutely great. In Switzerland four languages are mandatory and having language-support in RDF and SPARQL by default is a fantastic selling point. But I end up doing stuff like this:

    FILTER(LANGMATCHES(LANG(?userLangLabel), "%%LANG%%"))
...
    FILTER(LANGMATCHES(LANG(?defaultLangLabel), "en"))
...
    FILTER(LANGMATCHES(LANG(?emptyLangLabel), ""))
    BIND(COALESCE(?userLangLabel, ?defaultLangLabel, ?emptyLangLabel) AS ?label)

Where I replace "%%LANG%%" before executing the query by either the preferred language of the browser or whatever the webapp offered as language selection. So we definitely need a nicer way to write this kind of queries.

ktk commented 5 years ago

In Kineo, I've worked on a similar issue to (2) above by implementing a solution based on content-negotiation Accept-Language headers.

We implemented something like this as well in a webapp. It would be great if SPARQL could take that into consideration but it should still be possible to override this, as many times the language in the HTTP header is not necessarily the one that should be displayed.

kasei commented 5 years ago

We implemented something like this as well in a webapp. It would be great if SPARQL could take that into consideration but it should still be possible to override this, as many times the language in the HTTP header is not necessarily the one that should be displayed.

Agreed. That seems to align with common implementation approaches to content-negotiation in general, where there's often a way to override the Accept header with a query parameter or filename extension in the request URL. The key point here is that this is all expressed at the protocol level and not in the query string as lots of FILTER and LANGMATCHES.

VladimirAlexiev commented 4 years ago

For a future iteration of the Ontotext Platform I specified something that borrows features from Accept-Language, langMatches, SHACL sh:inLanguage, and SHEX @ features (including exact, prefix matching, exclusions). By default it gets one label (but see the last example). eg

en,en~: pure English, then any English dialect
en~,-en-US: any English, as long as it's not Trump's language (Orange Man in a White House-ish)
NONE,en: first no lang, then English
BROWSER,en~,~,ANY: first user's Accept-Language prefs, then any English, then any non-empty lang, then any lang whatsoever
ALL:en~,fr~: all English or French labels, or their dialects

I hope I haven't overcooked it.

ericprud commented 4 years ago

Some links for those ShEx features:

LanguageStem spec: http://shex.io/shex-semantics/index.html#values
primer example: http://shex.io/shex-primer/#sec-iri-range
semantic tests: https://github.com/shexSpec/shexTest/blob/master/validation/manifest.ttl#L10348-L11490

[Edited because my comment didn't make sense following @VladimirAlexiev's]

kasei commented 4 years ago

I've been thinking more about some of this recently. I would really like to see alignment around being able to bind a label for a graph node 1) relatively concisely and 2) without worrying about affecting the query cardinality. This seems like a big part of the Wikidata feature, and I think it's important to try to find a solution to this that doesn't require Wikidata's use of magic variables and SERVICE blocks. A strawman solution I've been thinking about would be a new built-in function (say, LABEL) that would take a node as an argument, and return a label for that node, if available. The function could use an implementation-defined label predicate, but allow the predicate to be overridden (similar to rdflib's preferredLabel). It could also take an optional language tag pattern parameter (which could likewise be implementation-defined).

SELECT (LABEL(?x) AS ?xLabel) SELECT (LABEL(?x, foaf:name) AS ?xLabel) SELECT (LABEL(?x, LANG="en,fr") AS ?xLabel)

Some notes about these examples:

There's potential interaction with #64 here if we wanted the predicate and language tag pattern to be named arguments
This still requires the AS syntax which is more verbose than the wikidata magic-variable approach (but solving that should probably be left to an orthogonal grammar issue)
Leaving the preferred language to be implementation-defined would allow flexibility including:
- Continuing to use extensions like the Wikidata magic SERVICE blocks
- Using Accept-Language headers
- Being dataset-depdendent

ktk commented 4 years ago

@kasei I like the syntax & idea but:

we should be very careful with defaults. Adding more "implementation-defined" magic is IMO something we should not do in SPARQL 1.2, that already bites us too often in 1.1. I work with many different stores on a regular base and this is a great nuisance.
IMO the only label everyone agrees on is rdfs:label, things like skos:prefLabel or schema:name explicitly subclass from that. One might be tempted to add "smart" default decisions but that depends so much on the dataset that it might bite the user badly. I've seen mixes of all those labels in the wild already. Also skos:altLabel & skos:hiddenLabel are subclassing rdfs:label and a label like skos:hiddenLabel should explicitly not be used according to its semantics. So simply say "choose rdfs:label or anything that sublasses it" would be a bad idea.
We should IMO either say default is rdfs:label and for everything else it needs to be specified or even not do a default at all.

JervenBolleman commented 4 years ago

I think the syntax should be available in the WHERE part. Yet the logic requirements are more inline with @VladimirAlexiev suggestion.

FILTER like constructs don't work on the execution side, as we end up needing to do equivalents to.

SELECT ?label
WHERE
{
  OPTIONAL {
   [] rdfs:label ?label .
   FILTER langMatches( lang(?label), "FR" ) }
  }
  OPTIONAL {
   FILTER(! BOUND(?label))
   [] rdfs:label ?label .
   FILTER langMatches( lang(?label), "EN" ) }
  }
  OPTIONAL {
   FILTER(! BOUND(?label))
   [] rdfs:label ?label .
   FILTER langMatches( lang(?label), "NL" ) }
  }
  FILTER(BOUND(?label))
}

Which in a generalizable syntax would be something like.

SELECT ?label 
WHERE
{
  [] rdfs:label __PREFER__(?label FROM langMatches(lang(_), "FR"),
                                       langMatches(lang(_), "EN"), 
                                       langMatches(lang(_), "NL")) .
}

The idea being that __PREFER__ would also work for other functions, not just languages. As __PREFER__ should accept anything that gives an Effective Boolean Value, we should be able to do things like

SELECT ?label 
WHERE
{
  [] rdfs:label __PREFER__(?label FROM (langMatches(lang(_), "FR") && strlen(_) >10),
                                       langMatches(lang(_), "EN"),
                                       langMatches(lang(_), "NL"), 
                                       datatype(_) == xsd:int) .
}

The _ placeholder and FROM instead of AS needs serious thought as well as the general impact on the engines. A shorter form of langMatches(lang) would of course also be nice in these cases.

__PREFER__ is somewhat similar to COALESCE. Yet, I don't think that COALESCE can provide the required logic, as in many cases more than one 'label' returns.

edit: PREFER is a suggestion and that was made bold by markdown. The '__' is to reiterate that this is an idea that is more about execution logic in the engines that a specific suggestion for a keyword.

kasei commented 4 years ago

@ktk @JervenBolleman I think having a general purpose operator such as the __PREFER__ one you show could be valuable.

However, I think there's also a competing desire to have a concise syntax to do the specific job of selecting labels. In proposing the select-expression-based LABEL function above, I was trying to think about solutions that might bring implementors/users such as wikidata back into the fold of being standards compliant. While I don't think any standards-based approach is going to be able to match the conciseness of the wikidata ?varLabel approach (because it uses magic variables that just magically get instantiated), I think it's valuable to consider what might be close enough to entice a transition towards a standards-compliant solution.

I agree there's risk involved in any feature that is entirely or in part implementation or installation defined. This is another case where I'm trying to see things from the perspective of endpoint/dataset maintainers who might feel too constrained by an rdfs:label-only solution. I think trying to explore possible solutions, even controversial ones, is important here; a decision between those solutions would probably be something for a future WG to decide.

JervenBolleman commented 4 years ago

@kasei I think if we have a feature like __PREFER__ then we can have a shorthand syntax for it that is like what you propose. My intuition is that it is easier to go from a general solution to a specific one than the other way around.

e.g. something in the shape of the example below.

SELECT ?label 
WHERE
{
  [] rdfs:label __PREFLANG__(?label, "FR, "EN", "NL", <urn:sparql:spec:accept-lang>).
}

We can also write up both proposals and see what people think is worth implementing.

ktk commented 4 years ago

We can also write up both proposals and see what people think is worth implementing.

That would be a good idea IMO. We might try to see if we could implement it in https://comunica.dev/ as well to play with it.

VladimirAlexiev commented 4 years ago

@JervenBolleman is that what you call concise? Splitting the args into a list, using a urn instead of a keyword for Accept-Language, and what of langMatches? these are not very concise

VladimirAlexiev commented 4 years ago

I want to point to everyone that the semantics of Accept-Language is a bit peculiar:

it always uses langMatches (I.e. "substring" match), there is no way to specify exact match
preference is specified by q values... I'm not sure the order of lang tags in the header has any impact

JervenBolleman commented 4 years ago

I started on a draft SEP for a PREFER/PREFERLANG option. https://github.com/JervenBolleman/sparql-12/blob/SEP_PREFER/SEP/SEP_PREFER/sep_prefer.md, comments and pr's welcome.

lisp commented 4 years ago

why would it not be sufficient to use an accept-like syntax to express the intent?

JervenBolleman commented 4 years ago

why would it not be sufficient to use an accept-like syntax to express the intent?

Would you mind giving an example of what you are thinking of? I think I understand what you suggest but would rather be sure.

lisp commented 4 years ago

the "Accept-Language" header has been mentioned in several entries, above, either as the actual source of a selection specification or as a pattern to be followed by or incorporated into some other means to express those criteria.

the question is, which of the described use cases require some result beyond that which would be possible by adding an isolated filter operator which interpreted such a pattern and, of those cases, do they justify introducing additional language forms and/or changing the execution model?

noted, from the discussion, above are

dependency on statement predicates
dependency on subject node type (which extends to a term joined through any predicate)
a need to interpret an Accept-Language header in some way other than the standard logic
result cardinality variants (single or multiple satisfying solutions)

of these, the first two point to deficiencies in sparql's ability to abstract. to solve that issue needs more than a special case operator addressed to this specific issue. it should be possible to cover the last two requirement in the filter predicate definition and/or the rules for its application.

kasei commented 4 years ago

@lisp i think there are several things being discussed here. But from my point of view, your list misses the point of a desire to see a concise syntax for some of these use cases. Mostly we can solve the issue with large queries using filters, aggregation, and select expressions, but that’s exactly what’s driven implementations like wikidata away from the spec.

a need to interpret an Accept-Language header in some way other than the standard logic

As for this, there is currently no standard logic relating to language conneg and SPARQL. I worry that existing suggestions to have a syntax for this runs counter to existing options at the Protocol level.

namedgraph commented 4 years ago

@JervenBolleman I don't like ACCEPTLANG. I think it should be a URI rather than keyword, in a similar fashion as collations (e.g. http://www.w3.org/2005/xpath-functions/collation/codepoint) and the union graph #59 (e.g. urn:x-arq:UnionGraph), allowing future extensibility.

Speaking of collations, should we consider them here as well? XPath and XQuery Functions and Operators 3.1 has already defined a lot of stuff like functions, I think would make sense to reuse as much as possible.

A collation is a specification of the manner in which ·strings· are compared and, by extension, ordered.

lisp commented 4 years ago

@lisp i think there are several things being discussed here. But from my point of view, your list misses the point of a desire to see a concise syntax for some of these use cases. Mostly we can solve the issue with large queries using filters, aggregation, and select expressions, but that’s exactly what’s driven implementations like wikidata away from the spec.

the list was intended to cover those "expressiveness" requirements which i could distill from the discussion. if it is not complete, please extend it. this discussion would do with a very concise statement of the issue. while it was not intended to overlook the requirement, my reference to its elements calls into question whether the evolution of sparql is best served by trying to meet those requirements in a single operator intended to address this issue.

with respect to which question i take as a red flag bolleman's allusion to additional arguments to the function which have nothing to do with language tags.

a need to interpret an Accept-Language header in some way other than the standard logic

... existing options at the Protocol level.

those options are what "standard logic" was meant to name.

semanticfire commented 4 years ago

I started on a draft SEP for a PREFER/PREFERLANG option. https://github.com/JervenBolleman/sparql-12/blob/SEP_PREFER/SEP/SEP_PREFER/sep_prefer.md, comments and pr's welcome.

Will the actually result variable still contain the language selected in the end?

JervenBolleman commented 4 years ago

@semanticfire in my draft the selected variables would be a Lang string and therefore still have it's language attached. In other words my prefer suggestion does not transform the variable.

Edit: added a section regarding this to my draft.

JervenBolleman commented 4 years ago

@namedgraph @VladimirAlexiev There are pro's and con's to using an IRI over a keyword. I instinctively grabbed for an IRI, but @VladimirAlexiev does have a point regarding the ease of a keyword.

@namedgraph for Collations and other issues of that kind I think are best added to issue #88

JervenBolleman commented 4 years ago

the question is, which of the described use cases require some result beyond that which would be possible by adding an isolated filter operator which interpreted such a pattern and, of those cases, do they justify introducing additional language forms and/or changing the execution model?

@lisp, I think the isolated "select" operator as suggested by @kasei LABEL is the closest to that minimal suggestion. It only adds a second pass over the execution with relatively simple queries. I think this worth writing up in more detail to see what a LABEL solution would look like. Yet, as it does add a second pass over the execution, it actually does change the way that query engines need to operate, while PREFER is (I think so, and PREFERLANG I am sure about) rewritable to SPARQL 1.1.

My experience tells me that only looking at the http Accept-Language header is not sufficient. Because then there is now no way to select a preference or override such a header. From the Swiss reality it would be something like select from values the first official federal language possible followed by English (with English only acceptable for convenience and has no legal standing). Very often even in Switzerland, such translations are incomplete and not all Strings will be available in 5 languages. How to best proceed in such cases depends on local "tradition" and expediency, best encoded in the query layer.

I think PREFER has a clearer more obvious complexity, yet it can also be used outside of the language selection problem. E.g. for stock picking, prefer whole milk over skimmed milk but either will do if needed or prefer a 2x4 over 3x6 but prefer from the shed instead of going to the hardware store.

lisp commented 4 years ago

My experience tells me that only looking at the http Accept-Language header is not sufficient

the question concerned whether an expression of a form analogous to an accept-language header had sufficient capacity to capture constraints sufficient to resolve this issue. matters of the runtime environment, such as whether the expression originated from a header and whether the expression would contribute of more complex logical combination, are orthogonal.

lisp commented 4 years ago

the ... "select" operator as suggested by @kasei LABEL ... does add a second pass

while it is clear that any operator which follows a "negotiation" paradigm constitutes an aggregation pattern, it was not clear from the description that the "SELECT ... LABEL" proposal required two passes. so long as the expression appears in the select clause, rather than as a filter component, there should be no problem. one need just clarify several issues:

(how) does it combine with other aggregation operators
is the solution cardinality single or multiple
is the predicate set drawn from the environment or passed as an argument

Tpt commented 4 years ago

I have one concern with the current PR #120: on the syntax level it adds operators inside of the basic graph patterns, breaking their simplicity.

Instead of having thePREFER and PREFERLANG operators inside of the BGPs, what about making them aggregate functions instead?

SELECT ?label 
WHERE
{
  [] rdfs:label PREFER(?label FROM langMatches(lang(?label), "FR"),
                         langMatches(lang(?label), "EN"), 
                         langMatches(lang(?label), "NL")) .
}

would be written instead

SELECT (PREFER(?label, langMatches(lang(?label), "FR"),
                         langMatches(lang(?label), "EN"), 
                         langMatches(lang(?label), "NL")) AS ?label) WHERE
{
  ?s rdfs:label ?l .
} GROUP BY ?s

PREFER would work just like SAMPLE but instead of returning any value, it would return a "best" one. This syntax has also the advantage of allowing to apply the PREFER logic on data generated from expressions or coming from different triple patterns.

namedgraph commented 4 years ago

@Tpt this makes much more sense!

kasei commented 4 years ago

I think it would be fine to define the semantics of this operator in terms of aggregation (though that wouldn't be necessary), but this solution is moving farther and farther away from a concise syntax, and towards solutions you can already write with a complex combination of existing operators (I think the query above essentially reduces to several BIND(LANGMATCHES(·))s, and a projection of COALESCE with internal IF conditionals). While simple examples (like the one above) are relatively straightforward, ensuring you can express more complex queries in terms of aggregation will result in much more complex syntax.

namedgraph commented 4 years ago

IMO consistency should be a priority over conciseness.

Tpt commented 4 years ago

@kasei I definitely agree that my proposal is slightly less concise (you need to add a GROUP BY ?s at the end of the query and to do a projection) but, however, I am not convinced that it will make complex queries much more complex. Do you have a practical example in mind? My impression (I have not checked it's formally true) is that, except negation cases, a query without aggregate:

SELECT ... WHERE { ... ?s ?p PREFER(?label, ...) ... }

could be rewritten if we don't care about duplicated result rows

SELECT ... (PREFER(?o, ...) AS ?label) WHERE { ... ?s ?p ?o ... } GROUP BY ?s ?p

There might be some need for subqueries with my proposal in case of other aggregates but I don't know how much it will matter in practice.

semanticfire commented 4 years ago

If you would go the aggregate function route isn't that increasing the amount of data in the intermediate resultsets? if you have a lot of languages like wikidata and a large resultset, your intermediate set will be huge. ( this could be optimized out on a single endpoint, but might be harder in federated situations )

Tpt commented 4 years ago

If you would go the aggregate function route isn't that increasing the amount of data in the intermediate resultsets? if you have a lot of languages like wikidata and a large resultset, your intermediate set will be huge. ( this could be optimized out on a single endpoint, but might be harder in federated situations )

It's a great point indeed. I believe you could push also PREFER into most federated queries because (with a bad pseudo-notation) PREFER(a, b) = PREFER(PREFER(a), PREFER(b)). But it requires good optimizers indeed.

kasei commented 4 years ago

If you would go the aggregate function route isn't that increasing the amount of data in the intermediate resultsets? if you have a lot of languages like wikidata and a large resultset, your intermediate set will be huge. ( this could be optimized out on a single endpoint, but might be harder in federated situations )

Yes, the intermediate result size would be larger. If you have many languages and many different strings you'd like to apply PREFER to, it could be much larger, as you would be producing the Cartesian product across the various label variables within each group.

There might be some need for subqueries with my proposal in case of other aggregates but I don't know how much it will matter in practice.

I think this would probably impact lots of use cases. Any time there's a aggregation at the same projection level where you want label(s), you'd have to be concerned with affecting the results. Here's an example from wikidata:

SELECT ?author ?authorLabel (COUNT(?paper) AS ?count)
WHERE
{
    ?article  schema:about ?author ;
        schema:isPartOf <https://species.wikimedia.org/> .
    ?author wdt:P31 wd:Q5.
    ?paper wdt:P50 ?author.
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
}
GROUP BY ?author ?authorLabel
ORDER BY DESC(?count)
LIMIT 200

In this case you only have one label variable, but the count would be wrong if you applied only a single aggregation step. To make this equivalent, you'd have to apply PREFER aggregation first, and then apply the COUNT aggregation second.

ericprud commented 4 years ago

I think typically, it would blow up by the average number of language candidates.

Like most aggregates, you can optimize this one by evaluating the PREFER each time you would have entered another row with the same GROUP BY key. (Some, like AVERAGE are trickier 'cause you have to index more state with the GROUP BY key.)

kasei commented 4 years ago

@ericprud wouldn't it be increased by each new ?fooLabel variable in the projection? Looking at a subset of organic wikidata queries (which motivates my thinking on this issue) from one of the 2018 log dumps, ~20–30% of queries use more than one such label variable (most of these use 2, but some use 10, 20, even up to 40!).

ericprud commented 4 years ago

@kasei, Good point, I was only imagining one ?fooLabel. Without optimization, your intermediate table would have rows for the cartesian product of the label candidates, which is as explosive as applying unifying filters at the end of a BGP. That said, I think the optimization works fine regardless of the number of labels.

Query:

SELECT ?predator
  (PREFER(langMatches(lang(?predLabel), "FR"),
          langMatches(lang(?predLabel), "EN"), 
          langMatches(lang(?predLabel), "NL")) AS ?chases)
  ?prey
  (PREFER(langMatches(lang(?preyLabel), "FR"),
          langMatches(lang(?preyLabel), "EN"), 
          langMatches(lang(?preyLabel), "NL")) AS ?flees)
 WHERE
{
  ?predator rdfs:label ?predLabel .
  ?predator <chases> ?prey .
  ?prey rdfs:label ?preyLabel .
}
GROUP BY ?predator ?prey

Data:

<dog> <chases> <cat> .
<cat> <chases> <mouse> .
<dog> rdfs:label "dog"@en, "hond"@nl, "perro"@es.
<cat> rdfs:label "chat"@fr, "cat"@en, "gato"@es.
<mouse> rdfs:label "muis"@nl, "raton"@es.

Results:

┌───────────┬────────────┬─────────┬────────────┐
│ ?predator │ ?chases    │ ?prey   │ ?flees     │
│     <cat> │  "chat"@fr │ <mouse> │  "muis"@nl │
│     <dog> │   "dog"@en │   <cat> │  "chat"@fr │
└───────────┴────────────┴─────────┴────────────┘

Without optimization, we would get an intermediate table with each of the cross products of labels (9 for dog chases cat, and 6 for cat chases mouse) (here in a random order):

   ┌───────────┬────────────┬─────────┬────────────┐
   │ ?predator │ ?predLabel │ ?prey   │ ?preyLabel │
 1 │     <dog> │ "perro"@es │   <cat> │  "chat"@fr │
 2 │     <dog> │ "perro"@es │   <cat> │   "cat"@en │
 3 │     <dog> │ "perro"@es │   <cat> │  "gato"@es │
 4 │     <cat> │  "chat"@fr │ <mouse> │  "muis"@nl │
 5 │     <cat> │  "chat"@fr │ <mouse> │ "raton"@es │
 6 │     <cat> │  "gato"@es │ <mouse> │  "muis"@nl │
 7 │     <cat> │  "gato"@es │ <mouse> │ "raton"@es │
 8 │     <dog> │   "dog"@en │   <cat> │  "chat"@fr │
 9 │     <dog> │   "dog"@en │   <cat> │   "cat"@en │
10 │     <dog> │   "dog"@en │   <cat> │  "gato"@es │
11 │     <dog> │  "hond"@nl │   <cat> │  "chat"@fr │
12 │     <dog> │  "hond"@nl │   <cat> │   "cat"@en │
13 │     <dog> │  "hond"@nl │   <cat> │  "gato"@es │
14 │     <cat> │   "cat"@en │ <mouse> │  "muis"@nl │
15 │     <cat> │   "cat"@en │ <mouse> │ "raton"@es │
   └───────────┴────────────┴─────────┴────────────┘

but we can keep it down to two rows by knowing that PREFER is an aggregate function which will produce one row for any combo of ?predator, ?prey. Procedurally, we'd add the 1. Our intermediate table as one row and an auxiliary index says that <dog>/<cat> is row 1. When adding 2 ("perro"/"cat"), we'd evaluate the prefs and see that FR is preferred over EN so we'd replace our intermediate results row 1 with that row. Etc.

The proposal doesn't tell us how to balance sub-optimal preferences between e.g. "dog"/"cat" vs "hond"/"chat". I would default to giving precedence in the order mentioned in the SELECT, so ?predLabel wins over ?preyLabel ("dog"/"cat").

lisp commented 4 years ago

is this

The proposal doesn't tell us how to balance sub-optimal preferences between e.g. "dog"/"cat" vs "hond"/"chat". I would default to giving precedence in the order mentioned in the SELECT, so ?predLabel wins over ?preyLabel ("dog"/"cat").

meant to describe that constraints are to be propagated from one prefer form to another? by analogy to successive select bindings, the effect should be readily achievable.

even in that case, it is not clear why a concern about space requirements arises. what it is about the proposed operation which requires other than the standard approaches to implementing aggregates in constant space? there is nothing which requires to operate over a set domain, nothing which depends on order, and nothing which depends on any relation between individual solutions.

Tpt commented 4 years ago

@ericprud @kasei Thank you for raising this performance concern. Indeed PREFER used as an aggregate function does only make sense for SPARQL evaluators able to optimize. But I hope that the SPARQL engines used for very large datasets like Wikidata are able to implement such optimizations so I am not sure if it is a huge problem in practice.

The proposal doesn't tell us how to balance sub-optimal preferences between e.g. "dog"/"cat" vs "hond"/"chat". I would default to giving precedence in the order mentioned in the SELECT, so ?predLabel wins over ?preyLabel ("dog"/"cat").

Sorry, I am not sure to understand the problem here. Aren't aggregate functions supposed to be evaluated independently and, so, return "dog"/"chat" from the two tuples "dog"/"cat" and "hond"/"chat"? The same problem exists in SPARQL 1.1 queries with two MIN/MAX.

ericprud commented 4 years ago

@lisp , @Tpt , oops, sorry, you're right. I was thinking that the aggregate evaluator (optimized or not), would have to pick the best row for any group by key (here: <dog>/<cat>) in the cartesian product. If that were the case, it would at some point have to pick between: 9 │ <dog> │ "dog"@en │ <cat> │ "cat"@en │ and 11 │ <dog> │ "hond"@nl │ <cat> │ "chat"@fr │ where en is preferred over nl but fr is preferred over en.

But the semantics don't have to be that fussy. Instead it can just look at each column individually and pick row 9's "dog" and row 11's "chat". In short, mea culpa, not an issue.

VladimirAlexiev commented 2 years ago

@JervenBolleman I looked at #120 and IMHO it doesn't address the following:

is the idea for PREFER to operate on longhand (Boolean function invocations), or on some shorthand syntax?
ACCEPTLANG is not a lang tag but a q-ordered list of preferences. Will this be translated magically to longhand?
Give more examples of a variety of matching modes. Below I show shorthand: longhand:
- exact vs prefix match, eg en: lcase(lang(?label))="en" vs en~: langMatches(lang(?label),"en")
- prefix match but excluding a lang: en~,-en-US: langMatches(lang(?label),"en") && !langMatches(lang(?label),"en-US")
- no lang; any lang (here // indicates some pseudo-COALESCE): NONE,en~,ANY: lang(?label)="" // langMatches(lang(?label),"en") // lang(?label)!=""
Get one (first by preference) vs get many (but not all available)

Aklakan commented 1 year ago

Here is a more sophisticated use case by selecting a label based on scoring the property IRI and the language tag. This is not (easily?) possible to achieve with PREFER but with LATERAL. For reference, here an example - not pretty but tested to work with Jena 4.8.0. You can change the scores in the VALUES block to play around.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX doap: <http://usefulinc.com/ns/doap#>
PREFIX : <http://www.example.org/>

:notepad rdfs:label
  "rdfs-Notepad"@en ,
  "rdfs-Notizblock"@de ,
  "rdfs-Bloc" .

:notepad doap:name
  "doap-Notepad"@en ,
  "doap-Notizblock"@de ,
  "doap-Bloc" .

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX doap: <http://usefulinc.com/ns/doap#>

SELECT ?s ?label {
  { SELECT DISTINCT ?s { ?s ?p ?o } }
  LATERAL {
    OPTIONAL {
      { SELECT ?s ?candLabel {
        VALUES (?iri ?iriScore ?lang ?langScore) {
          (rdfs:label 10 "en" 10)
          (rdfs:label 10 "de" 8)
          (rdfs:label 10 "" 5)
          (doap:name 5 "de" 10)
          (doap:name 5 "" 5)
        }
        ?s ?iri ?candLabel .
        FILTER(langMatches(lang(?candLabel), ?lang))
        BIND(?iriScore * ?langScore AS ?totalScore)
      } ORDER BY DESC(?totalScore) LIMIT 1 }
    }
    BIND(if(!bound(?candLabel), CONCAT('no candidate label found for ', STR(?s)), ?candLabel) AS ?label)
  }
}

JervenBolleman commented 1 year ago

@Aklakan yes, indeed. LATERAL is the powerful feature that is better than the PREFER IMHO as currently in the SEP. I should sit down and rewrite PREFER in terms of a LATERAL shorthand. 20 hours of plane time coming up so we might get lucky, otherwise I wouldn't mind someone else trying their hand at that.

VladimirAlexiev commented 7 months ago

I used COALESCE for a long time, eg see this generated query: https://gist.github.com/VladimirAlexiev/cf2de89b692bbc2ae70917aae021ec07#file-wd-mapping-sparql-L106-L112.

Then I learned an effective trick that in hindsight is very obvious:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
select * {
    values ?x {wd:Q472 wd:Q407198 wd:Q45625 wd:Q129632 wd:Q287380 wd:Q47187}
    OPTIONAL { ?x rdfs:label ?xLabel . FILTER (lang(?xLabel) = 'bg') }
    OPTIONAL { ?x rdfs:label ?xLabel . FILTER (lang(?xLabel) = 'de') }
    OPTIONAL { ?x rdfs:label ?xLabel . FILTER (lang(?xLabel) = 'en') }
    OPTIONAL { ?x rdfs:label ?xLabel . FILTER (lang(?xLabel) = 'fr') }
    OPTIONAL { ?x rdfs:label ?xLabel . FILTER (lang(?xLabel) = 'ru') }
}

The first OPTIONAL that matches, binds the variable, so the rest cannot succeed.

But for some reason, this trick does'n twork in https://github.com/rdf-community/discussions/discussions/7

TallTed commented 7 months ago

Interesting... I note that Virtuoso's output (query) includes the @lang tag on each ?xLabel value, while Blazegraph's output (query) omits it. The latter seems wrong.

w3c / sparql-dev

Resolving labels with language negotiation #13