predict-idlab / pyRDF2Vec

🐍 Python Implementation and Extension of RDF2Vec
https://pyrdf2vec.readthedocs.io/en/latest/
MIT License
250 stars 52 forks source link

Apply pyRDF2Vec on a knowledge graph with wikidata data #185

Closed STALINFIGUEROAALAVA closed 1 year ago

STALINFIGUEROAALAVA commented 1 year ago

❓ Question

knowledge_graph = KG("https://query.wikidata.org/sparql", is_remote=True)

transformer = RDF2VecTransformer( Word2Vec(epochs=10), walkers=[RandomWalker(4, 10, with_reverse=False, n_jobs=2)],

verbose=1

)

Get our embeddings.

embeddings = transformer.fit_transform(knowledge_graph, entities) print(embeddings)

I enter the correct entities. When I want to do the embeds in the KG I get the error: JSONDecodeError: Expecting value: line 1 column 1 (char 0)

GillesVandewiele commented 1 year ago

Can you paste a complete error stacktrace?

STALINFIGUEROAALAVA commented 1 year ago

JSONDecodeError Traceback (most recent call last)

in 5 ) 6 # Get our embeddings. ----> 7 embeddings = transformer.fit_transform(knowledge_graph, entities) 8 print(embeddings) ~\anaconda38\lib\site-packages\pyrdf2vec\rdf2vec.py in fit_transform(self, kg, entities, is_update) 141 """ 142 self._is_extract_walks_literals = True --> 143 self.fit(self.get_walks(kg, entities), is_update) 144 return self.transform(kg, entities) 145 ~\anaconda38\lib\site-packages\pyrdf2vec\rdf2vec.py in get_walks(self, kg, entities) 161 162 """ --> 163 if kg.skip_verify is False and not kg.is_exist(entities): 164 if kg.mul_req: 165 asyncio.run(kg.connector.close()) ~\anaconda38\lib\site-packages\pyrdf2vec\graphs\kg.py in is_exist(self, entities) 372 ] 373 else: --> 374 responses = [self.connector.fetch(query) for query in queries] 375 responses = [res["boolean"] for res in responses] 376 return False not in responses ~\anaconda38\lib\site-packages\pyrdf2vec\graphs\kg.py in (.0) 372 ] 373 else: --> 374 responses = [self.connector.fetch(query) for query in queries] 375 responses = [res["boolean"] for res in responses] 376 return False not in responses ~\anaconda38\lib\site-packages\cachetools\__init__.py in wrapper(self, *args, **kwargs) 565 except KeyError: 566 pass # key not found --> 567 v = method(self, *args, **kwargs) 568 try: 569 c[k] = v ~\anaconda38\lib\site-packages\pyrdf2vec\connectors.py in fetch(self, query) 134 url = f"{self.endpoint}/query?query={parse.quote(query)}" 135 with requests.get(url, headers=self._headers) as res: --> 136 return res.json() 137 138 def get_query(self, entity: str, preds: Optional[List[str]] = None) -> str: ~\anaconda38\lib\site-packages\requests\models.py in json(self, **kwargs) 896 # used. 897 pass --> 898 return complexjson.loads(self.text, **kwargs) 899 900 @property ~\anaconda38\lib\json\__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 355 parse_int is None and parse_float is None and 356 parse_constant is None and object_pairs_hook is None and not kw): --> 357 return _default_decoder.decode(s) 358 if cls is None: 359 cls = JSONDecoder ~\anaconda38\lib\json\decoder.py in decode(self, s, _w) 335 336 """ --> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 338 end = _w(s, end).end() 339 if end != len(s): ~\anaconda38\lib\json\decoder.py in raw_decode(self, s, idx) 353 obj, end = self.scan_once(s, idx) 354 except StopIteration as err: --> 355 raise JSONDecodeError("Expecting value", s, err.value) from None 356 return obj, end JSONDecodeError: Expecting value: line 1 column 1 (char 0)
GillesVandewiele commented 1 year ago

It seems (on first sight) that the SPARQL endpoint is returning an empty result. Either something is wrong with the endpoint, or with the entities you pass to it (do they exist?).

STALINFIGUEROAALAVA commented 1 year ago

The entities do exist. I have them in a local file He sent you the code and the data. codigotexto.txt pintores_italianosGML.csv

bsteenwi commented 1 year ago

Our SPARQL connector makes queries by using this url concatenation: url = f"{self.endpoint}/query?query={parse.quote(query)}"

but the Wikidata endpoint is: https://query.wikidata.org/sparql?query={SPARQL}. The best solution for now is to create your own connector based on our SPARQLConnector :

class WikiDataConnector(SPARQLConnector):
    async def _fetch(self, query) -> Response:
        url = f"{self.endpoint}?query={parse.quote(query)}"
        async with self._asession.get(url, headers=self._headers) as res:
            return await res.json()

    @cachedmethod(operator.attrgetter("cache"))
    def fetch(self, query: str) -> Response:
        url = f"{self.endpoint}?query={parse.quote(query)}"
        with requests.get(url, headers=self._headers) as res:
            return res.json()

By default the SPARQLConnector is used within our KG. To change this, I think you will have to create the KG object and change the connector.

knowledge_graph = KG("https://query.wikidata.org/sparql", is_remote=True)
knowledge_graph.connector = WikiDataConnector(endpoint = "https://query.wikidata.org/sparql")

I didn't had the time yet to test this, but I hope this can already help you

STALINFIGUEROAALAVA commented 1 year ago

Thanks Bram SPARQLConnector does not work. Do you have an example?

bsteenwi commented 1 year ago

This works for me:

import pandas as pd

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker
from pyrdf2vec.connectors import SPARQLConnector
from urllib import parse
from pyrdf2vec.typings import Literal, Response
import operator
from cachetools import Cache, TTLCache, cachedmethod
import requests
import time

class WikiDataConnector(SPARQLConnector):
    async def _fetch(self, query) -> Response:
        url = f"{self.endpoint}?query={parse.quote(query)}"
        print(url)
        async with self._asession.get(url, headers=self._headers) as res:
            return await res.json()

    @cachedmethod(operator.attrgetter("cache"))
    def fetch(self, query: str) -> Response:
        url = f"{self.endpoint}?query={parse.quote(query)}"

        with requests.get(url, headers=self._headers) as res:
            print(res)
            time.sleep(1)
            return res.json()

if __name__ == "__main__":

    entities = ["http://www.wikidata.org/entity/Q156622","http://www.wikidata.org/entity/Q368254","http://www.wikidata.org/entity/Q1117749"]
    print(entities)

    # Define our knowledge graph (here: DBPedia SPARQL endpoint).
    knowledge_graph = KG("https://query.wikidata.org/sparql")
    knowledge_graph.connector = WikiDataConnector(endpoint="https://query.wikidata.org/sparql")

    # Create our transformer, setting the embedding & walking strategy.
    transformer = RDF2VecTransformer(
        Word2Vec(epochs=10),
        walkers=[RandomWalker(2, 10, with_reverse=False)],
        # verbose=1
    )
    # Get our embeddings.
    embeddings, literals = transformer.fit_transform(knowledge_graph, entities)
    print(embeddings)

Keep in mind that I didn't use n_jobs and had to include a time.sleep(1) after every query to avoid overloading the Wikidata endpoint.

STALINFIGUEROAALAVA commented 1 year ago

Thank you very much Bram

STALINFIGUEROAALAVA commented 1 year ago

Bram. How do I see the embeds that are done with the code that is below. It only presents <Response [200]> <Response [200]> <Response [200]> and stays cycled.

# Create our transformer, setting the embedding & walking strategy.
transformer = RDF2VecTransformer(
    Word2Vec(epochs=10),
    walkers=[RandomWalker(2, 10, with_reverse=False)],
    # verbose=1
)
# Get our embeddings.
embeddings, literals = transformer.fit_transform(knowledge_graph, entities)
print(embeddings)
GillesVandewiele commented 1 year ago

You are extracting 10 random walks (which probably isn't sufficient) per entity for a list of 3800 entities. Given the time.sleep(1), this means a minimum of 38000 seconds

STALINFIGUEROAALAVA commented 1 year ago

Thank you Gilles I will do the tests in ibm watson.

bsteenwi commented 1 year ago

You can update the Wikidata connector to make requests faster. Take into account that the Wikidata endpoint has a hard query deadline configured which is set to 60 seconds. There are also following limits:

One client (user agent + IP) is allowed 60 seconds of processing time each 60 seconds One client is allowed 30 error queries per minute

Clients exceeding the limits above are throttled with HTTP code 429. Use Retry-After header to see when the request can be repeated. If the client ignores 429 responses and continues to produce requests over the limits, it can be temporarily banned from the service.

Clients who don’t comply with the User-Agent policy may be blocked completely – make sure to send a good User-Agent header.

Every query will timeout when it takes more time to execute than this configured deadline. You may want to optimize the query or report a problematic query here.

Also note that currently access to the service is limited to 5 parallel queries per IP. The above limits are subject to change depending on resources and usage patterns.

STALINFIGUEROAALAVA commented 1 year ago

Thank you very much Gilles and Bram. Everything is going very well. I would like to solve one last case internally. To which email can I share my google colab file