Closed STALINFIGUEROAALAVA closed 1 year ago
Can you paste a complete error stacktrace?
JSONDecodeError Traceback (most recent call last)
It seems (on first sight) that the SPARQL endpoint is returning an empty result. Either something is wrong with the endpoint, or with the entities you pass to it (do they exist?).
The entities do exist. I have them in a local file He sent you the code and the data. codigotexto.txt pintores_italianosGML.csv
Our SPARQL connector makes queries by using this url concatenation:
url = f"{self.endpoint}/query?query={parse.quote(query)}"
but the Wikidata endpoint is: https://query.wikidata.org/sparql?query={SPARQL}. The best solution for now is to create your own connector based on our SPARQLConnector :
class WikiDataConnector(SPARQLConnector):
async def _fetch(self, query) -> Response:
url = f"{self.endpoint}?query={parse.quote(query)}"
async with self._asession.get(url, headers=self._headers) as res:
return await res.json()
@cachedmethod(operator.attrgetter("cache"))
def fetch(self, query: str) -> Response:
url = f"{self.endpoint}?query={parse.quote(query)}"
with requests.get(url, headers=self._headers) as res:
return res.json()
By default the SPARQLConnector is used within our KG. To change this, I think you will have to create the KG object and change the connector.
knowledge_graph = KG("https://query.wikidata.org/sparql", is_remote=True)
knowledge_graph.connector = WikiDataConnector(endpoint = "https://query.wikidata.org/sparql")
I didn't had the time yet to test this, but I hope this can already help you
Thanks Bram SPARQLConnector does not work. Do you have an example?
This works for me:
import pandas as pd
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker
from pyrdf2vec.connectors import SPARQLConnector
from urllib import parse
from pyrdf2vec.typings import Literal, Response
import operator
from cachetools import Cache, TTLCache, cachedmethod
import requests
import time
class WikiDataConnector(SPARQLConnector):
async def _fetch(self, query) -> Response:
url = f"{self.endpoint}?query={parse.quote(query)}"
print(url)
async with self._asession.get(url, headers=self._headers) as res:
return await res.json()
@cachedmethod(operator.attrgetter("cache"))
def fetch(self, query: str) -> Response:
url = f"{self.endpoint}?query={parse.quote(query)}"
with requests.get(url, headers=self._headers) as res:
print(res)
time.sleep(1)
return res.json()
if __name__ == "__main__":
entities = ["http://www.wikidata.org/entity/Q156622","http://www.wikidata.org/entity/Q368254","http://www.wikidata.org/entity/Q1117749"]
print(entities)
# Define our knowledge graph (here: DBPedia SPARQL endpoint).
knowledge_graph = KG("https://query.wikidata.org/sparql")
knowledge_graph.connector = WikiDataConnector(endpoint="https://query.wikidata.org/sparql")
# Create our transformer, setting the embedding & walking strategy.
transformer = RDF2VecTransformer(
Word2Vec(epochs=10),
walkers=[RandomWalker(2, 10, with_reverse=False)],
# verbose=1
)
# Get our embeddings.
embeddings, literals = transformer.fit_transform(knowledge_graph, entities)
print(embeddings)
Keep in mind that I didn't use n_jobs and had to include a time.sleep(1) after every query to avoid overloading the Wikidata endpoint.
Thank you very much Bram
Bram. How do I see the embeds that are done with the code that is below. It only presents <Response [200]> <Response [200]> <Response [200]> and stays cycled.
# Create our transformer, setting the embedding & walking strategy.
transformer = RDF2VecTransformer(
Word2Vec(epochs=10),
walkers=[RandomWalker(2, 10, with_reverse=False)],
# verbose=1
)
# Get our embeddings.
embeddings, literals = transformer.fit_transform(knowledge_graph, entities)
print(embeddings)
You are extracting 10 random walks (which probably isn't sufficient) per entity for a list of 3800 entities. Given the time.sleep(1), this means a minimum of 38000 seconds
Thank you Gilles I will do the tests in ibm watson.
You can update the Wikidata connector to make requests faster. Take into account that the Wikidata endpoint has a hard query deadline configured which is set to 60 seconds. There are also following limits:
One client (user agent + IP) is allowed 60 seconds of processing time each 60 seconds One client is allowed 30 error queries per minute
Clients exceeding the limits above are throttled with HTTP code 429. Use Retry-After header to see when the request can be repeated. If the client ignores 429 responses and continues to produce requests over the limits, it can be temporarily banned from the service.
Clients who donβt comply with the User-Agent policy may be blocked completely β make sure to send a good User-Agent header.
Every query will timeout when it takes more time to execute than this configured deadline. You may want to optimize the query or report a problematic query here.
Also note that currently access to the service is limited to 5 parallel queries per IP. The above limits are subject to change depending on resources and usage patterns.
Thank you very much Gilles and Bram. Everything is going very well. I would like to solve one last case internally. To which email can I share my google colab file
β Question
knowledge_graph = KG("https://query.wikidata.org/sparql", is_remote=True)
transformer = RDF2VecTransformer( Word2Vec(epochs=10), walkers=[RandomWalker(4, 10, with_reverse=False, n_jobs=2)],
verbose=1
)
Get our embeddings.
embeddings = transformer.fit_transform(knowledge_graph, entities) print(embeddings)
I enter the correct entities. When I want to do the embeds in the KG I get the error: JSONDecodeError: Expecting value: line 1 column 1 (char 0)