lvaudor / glitter

an R package which writes SPARQL queries
https://lvaudor.github.io/glitter
44 stars 5 forks source link

Add messages for Wikidata query building (maybe for a Wikidata specific R package). #66

Closed maelle closed 1 year ago

maelle commented 2 years ago

55 handles spq_select(), spq_mutate(), spq_summarise() but I'm not sure how to handle spq_add() yet.

maelle commented 2 years ago

@lvaudor do you have any idea/wish?

Maybe

lvaudor commented 2 years ago

Originally, glitter (or, recitR actually) worked with the 'subject', 'verb', 'object' arguments, (there was no 'triplet' argument) but I like the triplet argument better (less quotes, simpler to read and understand in the LOD queries conceptual framework because they're just, basically, sentences). I tried to make sure that there were no other arguments starting with t (same for subject, verb, object too I think?) so that shortening the triplet argument with just 't=' is alright.

I'm not sure spq_add() can be modified in the same way you did for the other functions, in particular I have no idea how we could "unstring" it (and see, I was not expecting you to find a way hehe). Of course, if you see other ways to improve this function I'm all ears :-)

maelle commented 2 years ago

Ok, let's keep it as is for now at least. Thank you!

maelle commented 2 years ago

One thing that might be simplified is the label argument :thinking:

lvaudor commented 2 years ago

yes! and maybe we could have a label argument equivalent for queries to endpoints other than Wikidata? Typically that would mean that something like

add_triplet("?s v ?o", label=c("s"))

would add an implicit triplet

"?s rdfs:label ?sLabel"

lvaudor commented 2 years ago

... because I think adding these triplets manually is a bit cumbersome

maelle commented 2 years ago

Another worry I have is the difference between spq_add() and spq_filter() i.e. how to know where to put each part of a query, what's the best strategy :thinking:

maelle commented 2 years ago

I guess that if you can express something in terms of properties you should use spq_add(), and then spq_filter() is for value comparisons etc.

maelle commented 2 years ago

Regarding label it might be good to be able to add all of them by default. I.e. a parameter label=TRUE.

maelle commented 2 years ago

A thing that irks me with a line like spq_add("?stations wdt:P31 wd:Q928830") is that it's not human-readable. I wonder whether

spq_register("wdt:P31", as = `is_instance_of()`, service = "Wikidata")

We'd store the definition in an environment. We'd output a message with the label (so users could see "registered wdt:P31 (label "instance of") as glitter function is_instance_of()).

Then later

stations_metro_Lyon=spq_init() %>% 
  spq_add("?stations wdt:P361 wd:Q1552", label="?stations") %>% 
  spq_add(is_instance_of(station, stations = "wd:Q928830")) %>% 
  spq_perform()
maelle commented 2 years ago

To me spq_add("?auteur foaf:birthday ?jour") reads as spq_mutate(jour = foaf::birthday(auteur)) :thinking:

maelle commented 2 years ago

Or spq_add(jour = foaf_birthday(auteur))

maelle commented 2 years ago

Current

tib=spq_init() %>% 
  spq_add("?auteur foaf:birthday ?jour") %>% 
  spq_add("?auteur bio:birth ?date1") %>% 
  spq_add("?auteur bio:death ?date2") %>% 
  spq_add("?auteur foaf:name ?nom", required=FALSE) %>% 
  spq_arrange(jour) %>% 
  spq_prefix() %>% 
  spq_head(n=10) %>%
  spq_perform(endpoint="dataBNF")

Maybe nicer (note the endpoint comes earlier as it's central):

tib=spq_init(auteur, endpoint = "dataBNF") %>% 
  spq_add(jour = foaf_birthday(auteur)) %>% 
  spq_add(date1 = bio_birth(auteur)) %>% 
  spq_add(date1 = bio_death(auteur)) %>% 
  spq_add(nom = foaf_name(auteur), required = FALSE) %>% 
  spq_arrange(jour) %>% 
  spq_prefix() %>% 
  spq_head(n=10) %>%
  spq_perform()
maelle commented 2 years ago

For spq_add("{fleurs_du_mal} foaf:focus ?Oeuvre") I think spq_filter(foaf_focus(Oeuvre) == fleurs_du_mal)

maelle commented 2 years ago

I keep coming back to my spq_register() idea to register synonyms. I really think it could help in some cases, but shouldn't be compulsory.

maelle commented 2 years ago

Current thoughts

So spq_add("?mayor wdt:P31 ?species") would be spq_filter(mayor = wdt::P31(species)) whereas spq_add("?auteur bio:birth ?date1") would be spq_add(date1 = bio::birth(auteur))

We can keep a function adding SPARQL filters directly with spq().

Instead of having a special rule for is / %in% I'd like to define two functions

The messages are still something I'd like to add.

maelle commented 2 years ago

I'm still not sure we want to have such different behavior for spq_filter() and spq_add(). FILTER is for filtering the data so it's different.

maelle commented 2 years ago
lvaudor commented 2 years ago

I guess that if you can express something in terms of properties you should use spq_add(), and then spq_filter() is for value comparisons etc.

Indeed. I guess that renaming spq_add() into spq_pattern() (following what I told you about triplets and triplet patterns) could make this difference clearer?

lvaudor commented 2 years ago

Regarding label it might be good to be able to add all of them by default. I.e. a parameter label=TRUE.

Yes and no, because not all unknowns can be labelled (for instance, a date or image link can't) and for now the "strain" of knowing whether asking for a label makes sense is on the user himself (who knows better -?- than asking for label for a date for instance). You could probably add the "labelling triplet pattern" with the required=FALSE option (so that you would get empty columns for some labels but not remove non-labelled individuals altogether) but that would return rather large tables with many void columns which I think is not ideal.

lvaudor commented 2 years ago
  • spq_triple() to add an actual triple (like spq_add() now)

    • spq_add() to add a thing to the results like birth date

    • spq_specify() to add a filter to the result like "item is an instance of cat or dog". thinking

OK, so for now I'd say:

maelle commented 2 years ago

Regarding spq_mutate() , spq_add("?auteur bio:birth ?date1") would then be spq_mutate(date1 = bio::birth(auteur))?

In which case the way we'd recognize it's not a mutate resulting to "blabla AS truc" is the presence of ::.

maelle commented 2 years ago

Just a note that by adding new behaviors to spq_mutate() and spq_filter() we're hiding some concepts from the users but it might be fine.

maelle commented 2 years ago

Also noting that for all functions using ... I'll add a dot in front of the other arguments for avoiding name clashes. E.g. ".triple".

lvaudor commented 2 years ago

Regarding spq_mutate() , spq_add("?auteur bio:birth ?date1") would then be spq_mutate(date1 = bio::birth(auteur))?

In which case the way we'd recognize it's not a mutate resulting to "blabla AS truc" is the presence of ::.

In the same way that a R-user can consider that "?thing is an instance of wd:xxxx" is a kind of filter (and hence might be tempted to pass it through a call to spq_filter) he/she can consider that "?thing has property ?stuff" is a kind of mutate since it adds a variable. I must reckon that I hadn't thought of allowing for the syntax spq_mutate(date1 = bio::birth(auteur))

I thought: either spq_mutate(triplet) and then it's a disguised called to spq_add()

or something like

spq_mutate(stuff=n(thing)) (with the SPARQL keywords translated as R functions)

Because right now you have not implemented these "bio::birth"-like functions right?

maelle commented 2 years ago

Because right now you have not implemented these "bio::birth"-like functions right?

I have started, actually, and it's not hard to support. Example https://github.com/lvaudor/glitter/pull/81/files#diff-762db8c96d7eced05483d186e208c0af7707b637e8a350af34d3165632fb7257R21

So I'd extend that to other functions + add a .triple argument to the functions. Does that sound good?

lvaudor commented 2 years ago

Because right now you have not implemented these "bio::birth"-like functions right?

I have started, actually, and it's not hard to support. Example https://github.com/lvaudor/glitter/pull/81/files#diff-762db8c96d7eced05483d186e208c0af7707b637e8a350af34d3165632fb7257R21

So I'd extend that to other functions + add a .triple argument to the functions. Does that sound good?

Great, I thought that might be a bit of a hassle. So, yes, sounds good!

maelle commented 2 years ago

In the PR a TODO would be to make spq_add() simple again.

lvaudor commented 2 years ago

just one thought:

Another argument against "forcing" all "s v o" triplet patterns into o=v(s) arguments is that sometimes you have "s ?v o" (which properties link subject and object) or "?s v o" (which subject is such that s v o) so how would you translate this into R logic?

(I'm just trying to justify my reluctance to drop triplet patterns entirely ;-) )

maelle commented 2 years ago

"s ?v o" (which properties link subject and object) or "?s v o" (which subject is such that s v o) so how would you translate this into R logic?

Oh yes we definitely need a way to keep them. Now for the sake of completion could you please give me two examples of those?

lvaudor commented 2 years ago

I think there can be quite a lot of examples of "?s v o" . One (well, two) in the Wikidata vignette:

stations_metro_Lyon=spq_init() %>% spq_add("?stations wdt:P361 wd:Q1552", label="?stations") %>% spq_add("?stations wdt:P31 wd:Q928830") %>% spq_add("?stations wdt:P625 ?coords") %>% spq_perform()

As for "s ?v o" I think we have no example for now but you'd get them when you're trying to explore a bit the contents of a database and which properties are available for an item, like in the following (SPARQL) query on DBpedia which looks for all the properties "to or from" the Apple company:

select distinct ?prop where { {?apple a http://dbpedia.org/ontology/Company . ?apple rdfs:label ?name. filter(regex(?name, "Apple Inc"))}.

{{?x ?prop ?apple} union {?apple ?prop ?y}}}

maelle commented 2 years ago

From this issue, we need to keep the idea to add messages for Wikidata query building (maybe for a Wikidata specific R package).

maelle commented 2 years ago

and the label argument could still use some simplifications eventually.

lvaudor commented 2 years ago

Yes, that's one of the things I can envision the most clearly (and excitedly) for future contracts actually ;-)

maelle commented 1 year ago

From this issue, we need to keep the idea to add messages for Wikidata query building (maybe for a Wikidata specific R package).

What would this be? Would this be in a separate package? Or should we drop this?

lvaudor commented 1 year ago

Ouch this issue and following conversation was very long indeed (and many aspects of it have been settled now: labelling is now greatly improved, instances of ?v are in your "exploring new SPARQL endpoints" vignette, etc.). The remaining considerations (should we build a Wikidata-specific package to handle Wikidata-tailored query-building messages) imply that we have the time to do so and we don't... So I think we can close it!