usc-isi-i2 / kgtk

Knowledge Graph Toolkit
https://kgtk.readthedocs.io/en/latest/
MIT License
351 stars 57 forks source link

SPARQL queries #450

Open GiorgioBarnabo opened 3 years ago

GiorgioBarnabo commented 3 years ago

Dear kgtk team,

I recently read your seminal paper on the KGTK system, and I found it of great interest. You really are tackling an urgent problem in the community. Nonetheless, after looking at the example folder on the github repository, I was not able to fully understand whether your system can be esily used to perform SPARQL queries on the KG I need to work with. For example, it would be cool if the user could express his or her queries in SPARQL without specifying all the operations that a relational database would need to perform. If I understood correctly, I should manually translate the SPARQL queries into cypher, right? If this is the case, could you please point me to a good learning resource?

Apparently, it seems that KGTK is more a pre-processing tool, more than a system for performing queries faster. Am I missing something?

Finally, I wanted to ask you how I can download and index the whole wikidata into KGTK.

Sorry for the basic questions and thanks again for this very cool contribution.

Best,

Giorgio

GiorgioBarnabo commented 3 years ago

ps. I would be also cool to create a Colab Notebook to show some basic functionalities of the system ;)

szeke commented 3 years ago

Hi Giorgio, thanks for your interest in KGTK. KGTK does not directly support SPARQL. KGTK provides a command to output a KGTK knowledge graph as triples https://kgtk.readthedocs.io/en/latest/export/generate_wikidata_triples/ so you load the graph in a triple store and do SPARQL queries. However, the Cypher language in KGTK (called Kypher) supports more efficient query for analytic queries. The table below shows a comparison of Kypher with SPARQL on several query types (the paper is currently under review).

image

The documentation for Kypher is https://kgtk.readthedocs.io/en/latest/transform/query/

You can find the Wikidata files in KGTK format at https://drive.google.com/drive/folders/16afEfKJGJDVXwUhnTVrgdceQC7ddNOAC?usp=sharing

A collar notebook is a good idea. We will work to provide one.

GiorgioBarnabo commented 3 years ago

Dear Pedro,

thank you very much for your answer. I will look into the documentation that you pointed out to me.

As for the Wikidata files in KGTK format, I was not able to understand which one I should download! Are they all different wikidata-related tables?

Moreover, if I wanted to directly download and index a wikidata dump, which one should I choose?

Thanks again and let me know if you need any help with the implementation.

Best,

Giorgio

dgarijo commented 3 years ago

Hi Giorgio, in the link @szeke shared (https://drive.google.com/drive/u/1/folders/16afEfKJGJDVXwUhnTVrgdceQC7ddNOAC) you can find the KGTK files corresponding to the Feb, 2021 Wikidata JSON dump. The files are different slices of the WIkidata dump. For example. if you want to load only the quantities, you can just work with claims.quantity.tsv.gz. This is usually more efficient depending on your use case.

If you want to use them all, you can load all.tsv.gz, which is around 65GB.

If you want to do the conversion yourself, just choose a Wikidata JSON dump and run the import-wikidata kgtk command, which depending on the dump may take a few hours (in a decent server, it took 3-4 hours). This is the command I usually run:

kgtk import-wikidata \
        -i $line \
        --node "$folder_new_name"/nodefile.tsv \
        --edge "$folder_new_name"/edgefile.tsv \
        --qual "$folder_new_name"/qualfile.tsv \
        --use-mgzip-for-input True \
        --use-mgzip-for-output True \
        --use-shm True \
        --procs 12 \
        --mapper-batch-size 5 \
        --max-size-per-mapper-queue 3 \
        --single-mapper-queue True \
        --collect-results True \
        --collect-seperately True\
        --collector-batch-size 10 \
        --collector-queue-per-proc-size 3 \
        --progress-interval 500000

where $line is the input json file (gzipped), and $folder_new_name is the folder where I want to store the results. I usually separate the dump in 3 files: the node file (with basic info about all entities), the edge file (the graph) and qualifier file (only qualifiers). Hope that helps

GullyBurns commented 2 years ago

Hi Guys,

To follow on from this, is there any query mechanism for running recursive queries within a knowledge graph in KGTK?

Gully

szeke commented 2 years ago

KGTK does not support sub-queries. However, KGTK supports chaining of queries, so you can run one query, produce a TSV file and use the output TSV file as input to another query

GullyBurns commented 2 years ago

I thinking of the SPARQL construct where you can follow a chain of relations

This query finds any descendent subclass node of the ?mondo_id class through the use of the + extension.

SELECT DISTINCT ?mondo_id ?name ?descendent_id ?descendent_name
WHERE {
    ?mondo_id rdfs:label ?name .
    ?descendent_id rdfs:subClassOf+ ?mondo_id .
    ?descendent_id rdfs:label ?descendent_name .

Is this possible in the KGTK?