neelguha / simple-wikidata-db

A set of Python scripts for preprocessing the Wikidata JSON dump and running simple queries in an efficient manner.
98 stars 18 forks source link

KeyError 'datatype' when preprocessing the latest wikidata dump (as of April 16) #7

Open phucdoitoan opened 5 months ago

phucdoitoan commented 5 months ago

Hi,

Thank you for the useful github code.

When I run the code in preprocess_dump.py to process the lastest wikidata dump (as of April 16) with 28 processes, I got the following error with processes 28. However, the code seems still running and produce processed tables.

Do you know if the error is something I should care about or I can just ignore it?

Thank you a lot!

Process Process-28: Traceback (most recent call last): File "**/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "**/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "**/simple-wikidata-db/simple_wikidata_db/preprocess_utils/worker_process.py", line 151, in process_data out_queue.put(process_json(ujson.loads(json_obj), language_id)) File "**/simple-wikidata-db/simple_wikidata_db/preprocess_utils/worker_process.py", line 91, in process_json datatype = claim['mainsnak']['datatype'] KeyError: 'datatype'

neelguha commented 5 months ago

I haven't gotten a chance to try and reproduce the error, but it looks like at least one of the claim objects doesn't have a datatype key. I haven't seen this error previously, so I wonder if it's something in most recent dump?

One small fix would be to disregard all claims which don't have a datatype key, and then count how many you drop (or write them to some error log file)?

phucdoitoan commented 5 months ago

Hi there,

Thanks a lot for your reply. I do not know much about wikidta so I'm not sure datatype key is something recent. I'll try your suggestion. However, even with the error reported, it seems like the code works fine and all the output tables seem okie.