nichtich / wikidata-taxonomy

command-line tool to extract taxonomies from Wikidata
https://www.npmjs.org/package/wikidata-taxonomy
MIT License
124 stars 11 forks source link

--sparql-endpoint / -e for http://localhost:8989/bigdata/sparql (custom Wikibase instance) returns data from Wikidata #45

Open dbs opened 5 years ago

dbs commented 5 years ago

Running Wikibase locally, I can generate results via curl:

curl http://localhost:8989/bigdata/sparql?SELECT%20DISTINCT%3Fp%20WHERE%20%7B%20%3Fs%20%3Fp%20%3Fo%20%7D

But trying to query the same endpoint with wikidata-taxonomy returns data from Wikidata instead:

node wdtaxonomy.js Q3 --sparql-endpoint http://localhost:8989/bigdata/sparql 
life (Q3) •188 ↑↑↑
├──extraterrestrial life (Q181508) •81 ×1 ↑
│  ├──life on Mars (Q601319) •34 ×1
│  ├──Martian (Q913850) •25 ×4
│  ├──Life on Titan (Q2591050) •15
│  └──extraterrestrial intelligence (Q15107669) •7
├──personal life (Q2867027) •20
└──human life (Q19771042) •3

I get the same result if I install wikidata-taxonomy globally with npm install -g

It's late night so I'll toss a theory: does it implicitly depend on properties such as P279 existing in the target endpoint, and it falls back to Wikidata if the query to the specified endpoint doesn't return the expected data?

dbs commented 5 years ago

Answering my own question, it does indeed rely on properties, but we are given options for mapping those properties to our own instances:

And these are required if you're using a Wikibase instance.

However, this still doesn't resolve my problem - I'm still getting results back from Wikidata instead of the Wikibase instance.

So on my Wikibase instance, where the WD property P279 maps to P297, and WD P31 maps to P28, and WD P1709 maps to P251, the following request:

wdtaxonomy --sparql-endpoint http://localhost:9292/bigdata/sparql -P P297,P28 -m P251 -s Q46

generates the following query:

  SELECT ?item ?broader ?itemLabel ?instances ?sites ?mapping ?mappingProperty WITH {
    SELECT DISTINCT ?item { ?item wdt:P297* wd:Q46 }
  } AS %items WHERE { 
    INCLUDE %items .
    OPTIONAL { ?item wdt:P297 ?broader } .
    {
      SELECT ?item (count(distinct ?element) as ?instances) {
        INCLUDE %items.
        OPTIONAL { ?element wdt:P28 ?item }
      } GROUP BY ?item
    }
    {
      SELECT ?item (count(distinct ?site) as ?sites) {
        INCLUDE %items.
        OPTIONAL { ?site schema:about ?item }
      } GROUP BY ?item
    }
    {
      SELECT ?item ?mapping ?mappingProperty {
        INCLUDE %items .
        OPTIONAL {
          { ?item wdt:P251 ?mapping . BIND('P251' AS ?mappingProperty) }
        }
      }
    }
    SERVICE wikibase:label {
      bd:serviceParam wikibase:language "en"
    }
  }

And if I run that query against my instance, I get the expected results for the partial import of human/person on my instance (just the top few lines included for space):

item | broader | itemLabel | instances | sites | mapping | mappingProperty
-- | -- | -- | -- | -- | -- | --
http://wikibase.svc/entity/Q46 | http://wikibase.svc/entity/Q421 | human | 1 | 0 | http://schema.org/Person | P251
http://wikibase.svc/entity/Q46 | http://wikibase.svc/entity/Q421 | human | 1 | 0 | http://dbpedia.org/ontology/Person | P251
http://wikibase.svc/entity/Q421 | http://wikibase.svc/entity/Q425 | person | 0 | 0 | http://xmlns.com/foaf/0.1/Person | P251
http://wikibase.svc/entity/Q421 | http://wikibase.svc/entity/Q425 | person | 0 | 0 | http://id.loc.gov/ontologies/bibframe/Person | P251

However, if I drop the -s parameter to run that query against my Wikibase instance:

wdtaxonomy --sparql-endpoint http://localhost:9292/bigdata/sparql -P P297,P28 -m P251 Q46

Instead of seeing the hierarchy for Q46 ('human') from my instance, I am instead shown the taxonomy for Q46 ('Europe') drawn from Wikidata--clearly using the property mappings that break the hierarchy:

Europe (Q46) •350

So, still trying to figure out why the --sparql-endpoint parameter appears to be being ignored. Maybe a louder warning or error message might help identify whatever I'm still doing wrong?

nichtich commented 5 years ago

Thanks for notification - actually I've never tested the --sparql-endpoint option so it was broken, sorry for that. Can you please check out the latest version from source and try again?

dbs commented 5 years ago

Thanks, that appears to connect properly to the SPARQL endpoint, so this specific issue can be closed.

However, it never returns any results, which should probably be the subject of a new issue. My suspicion is that is because the default namespace for Wikibase is http://wikibase.svc/ but lib/query.js hardcodes http://www.wikidata.org/; similarly lib/query.js has a hardcoded reference to P31, lib/serialize.js refers to http://www.wikidata.org/entity/P31 for occurrence counts, and lib/mappings.js also has a number of hardcoded relational P-ids.

It seems like adapting this wonderful tool to truly support Wikibase would require a fair bit of change to support all of the required mappings dynamically (either at the command line or by passing in a file of the mappings). Perhaps that's not feasible in the short term.

nichtich commented 5 years ago

Thanks again, I should better try with a Wikibase instance of my own. Your analysis is correct, to fix this it requires:

And options should better be read from config file (#46)