scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
30 stars 69 forks source link

Query timeout limit reached while updating German Nouns #124

Closed shashank-iitbhu closed 5 months ago

shashank-iitbhu commented 7 months ago

Terms

Behavior

Description

(scribedev) shashankmittal@ShashanksLaptop Scribe-Data % python3 src/scribe_data/extract_transform/wikidata/update_data.py '["German"]' '["nouns", "verbs"]' 
Data updated:   0%|                                                                                                                   | 0/2 [00:00<?, ?dirs/s]Querying and formatting German nouns
Data updated:   0%|                                                                                                                   | 0/2 [01:00<?, ?dirs/s]
Traceback (most recent call last):
  File "/Users/shashankmittal/Documents/Developer/scribe/Scribe-Data/src/scribe_data/extract_transform/wikidata/update_data.py", line 141, in <module>
    results = sparql.query().convert()
  File "/opt/anaconda3/envs/scribedev/lib/python3.10/site-packages/SPARQLWrapper/Wrapper.py", line 1196, in convert
    return self._convertJSON()
  File "/opt/anaconda3/envs/scribedev/lib/python3.10/site-packages/SPARQLWrapper/Wrapper.py", line 1059, in _convertJSON
    json_str = json.loads(self.response.read().decode("utf-8"))
  File "/opt/anaconda3/envs/scribedev/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/anaconda3/envs/scribedev/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/anaconda3/envs/scribedev/lib/python3.10/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 320797 column 115 (char 6713171)

Query builder Link

The query time limit is reached that's why results = sparql.query().convert() in update_data.py is throwing json.decoder.JSONDecodeError due to Invalid control character at: line 320797 column 115 (char 6713171) in sparql.query().response as it contains the timeout error logs.

Suggested Changes

shashank-iitbhu commented 7 months ago

This can be reproduced by running: python3 src/scribe_data/extract_transform/wikidata/update_data.py '["German"]' '["nouns", "verbs"]' @andrewtavis Are you able to reproduce this issue? If so, I can open a PR with the proposed changes.

andrewtavis commented 7 months ago

I can confirm on my end, @shashank-iitbhu:

json.decoder.JSONDecodeError: Invalid control character at: line 320797 column 115 (char 6713171)

Two questions to decide on this:

All in all great that you figured this out and suggested solutions! As you can see by the verbs queries, this is not the first time that this has happened 😅

andrewtavis commented 7 months ago

Confirmed from the Wikidata team that splitting based on nouns and proper nouns would be the initial path forward, but offset could work if it continues to be problematic :)

andrewtavis commented 7 months ago

Note that I just tried to query only all the singulars of only the nouns, not the proper nouns, and it's still failing. At this point it might make sense to use LIMIT and OFFSET.

andrewtavis commented 6 months ago

CC @daveads, do you want to write in here so I can assign this issue to you?

daveads commented 6 months ago

yup @andrewtavis

andrewtavis commented 5 months ago

Lots of commits above, and post a discussion with a Wikidata admin today I was able to get it working with changes to the query itself. This issue has been great for Scribe as it really unveiled a lot of parts of the queries that weren't necessary and were slowing things down. Note on this as well: maybe something to consider if a query stops is to remove the labeling service if it's been used, as there is a lot of overhead to get it to run over the results of a large query.

With the above being said, there have been lots of improvements, and I'm super grateful to @shashank-iitbhu for opening this and to @daveads for all the conversations that led to the possible solutions that led us to here! 😊 Thanks so much!