maxlath / wikibase-cli

read and edit a Wikibase instance from the command line
MIT License
223 stars 24 forks source link

Fetch a property for a list of QIDs in CSV format #148

Open tuukka opened 3 years ago

tuukka commented 3 years ago

Is there a good way to fetch one or multiple properties for a (potentially long) list of QIDs in CSV format?

Here's example code for what I have this far using wd convert but would it make sense for it to support --format csv and fetching more than one property at a time?

$ echo Q3572332 Q98407233 Q10428420 | wd convert --subjects --property P6375 | jq -r '
  to_entries
    | .[]
    | .key as $qcode
    | .value[] as $address
    | [$qcode,$address]
    | @csv'
"Q3572332","Eläintarhantie 1"
"Q3572332","Siltasaarenkatu 18"
"Q98407233","Agricolankatu 1-3"
"Q10428420","Fleminginkatu 1"
"Q10428420","Porthaninkatu 12"
"Q10428420","Viides linja 11"

Or the same using wd data: (but does it fetch all item data and would it be more difficult to implement --format csv?)

$ echo Q3572332 Q98407233 Q10428420 | wd data --simplify --props claims.P6375 | jq -r '
  .id as $qcode 
    | .claims.P6375[] as $address
    | [$qcode,$address]
    | @csv'
"Q3572332","Eläintarhantie 1"
"Q3572332","Siltasaarenkatu 18"
"Q98407233","Agricolankatu 1-3"
"Q10428420","Viides linja 11"
"Q10428420","Fleminginkatu 1"
"Q10428420","Porthaninkatu 12"
maxlath commented 3 years ago

Is there a good way to fetch one or multiple properties for a (potentially long) list of QIDs

For one property, wd convert seems to do the job, but it would currently not work for multiple properties. You could write a SPARQL request extending what wd convert does, but would need to handle the split into batches (wd convert uses batches of a 1000 at once)

in CSV format

It can get tricky to get from JSON with deeply nested objects to CSV, but could work for some basic cases.

but does it fetch all item data

No, but almost: when you specify --props claims.P6375, the smallest amount of data we can request to the API is basic info + all the claims by setting props=claims

would it be more difficult to implement --format csv?

I gave it a try in the this branch. The proposed syntax would be:

echo Q3572332 Q98407233 Q10428420 | wd data --props claims.P6375 --format csv

and output

id,claims.P6375
Q3572332,"Eläintarhantie 1,Siltasaarenkatu 18"
Q98407233,Agricolankatu 1-3
Q10428420,"Viides linja 11,Fleminginkatu 1,Porthaninkatu 12"

Note that P6375 values are grouped per entity: we could generate several rows per entity as in your version, but I'm not sure how we could make it work for cases where there are several properties (generating all combinations seems unnecessarily verbose). Would that work for your use case?

tuukka commented 3 years ago

Thank you for the quick implementation!

I was thinking this would be useful in lots of use cases, but my current use case is trying to find matches between certain Wikidata items and another big dataset (OpenStreetMap) based on street addresses. In this case, I need separate rows for each address to see if any of them match, and if I matched on multiple properties, it would be preferable to get all the combinations to see if any of them match a combination present in the other dataset. Could it make sense to do that by default and have an option like --join-values , to get your current output?

Multiple values is the difficult part also in the sense that before today I had no idea how to do the above in jq. I can manage now but I would not want to suggest anyone to learn this. :sweat_smile: (This made it click in the manual: "Thus as functions as something of a foreach loop.")

maxlath commented 3 years ago

I'm very grateful that you posted those jq commands, I use jq a lot but never encountered those as before, quite powerful ^^

tuukka commented 3 years ago

I have to add I'm not saying the solution for now couldn't be to include an example like these in wikibase-cli's documentation and people can use them as templates for what they need.

maxlath commented 3 years ago

I pushed more commits on that branch: now echo Q3572332 Q98407233 Q10428420 | wd data --props claims.P6375 --format csv outputs

id,claims.P6375
Q3572332,Eläintarhantie 1
Q3572332,Siltasaarenkatu 18
Q98407233,Agricolankatu 1-3
Q10428420,Viides linja 11
Q10428420,Fleminginkatu 1
Q10428420,Porthaninkatu 12

but the previous behaviour can, as suggested, be recovered with --join. Ready to merge, or do you see any missing feature?

tuukka commented 3 years ago

I tested the current version briefly and I would have wanted to specify a custom separator instead of the comma as an argument to --join as e.g. addresses often contain commas in them.

Also, I expected adding a claim to just result in an added column to the non-joined results, but of course, it turned on the joined mode. I understand this avoids combinatorial explosions but is it more important than consistency? echo Q3572332 Q98407233 Q10428420 | PATH=bin:$PATH wd data --props claims.P6375,claims.P4595 --format csv:

id,claims.P6375,claims.P4595
Q3572332,"Eläintarhantie 1,Siltasaarenkatu 18",Helsinki
Q98407233,Agricolankatu 1-3,Helsinki
Q10428420,"Viides linja 11,Fleminginkatu 1,Porthaninkatu 12",Helsinki

(By the way, I also noticed that the argument to format is not validated as I sometimes typed "CSV" instead of "csv".)