Extract Wikidata loading and data exporting to utils.py

scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction

GNU General Public License v3.0

23 stars 25 forks source link

Extract Wikidata loading and data exporting to utils.py #80

Closed andrewtavis closed 6 months ago

andrewtavis commented 7 months ago

Terms

[X] I have searched open and closed feature requests
[X] I agree to follow Scribe-Data's Code of Conduct

Description

In the process of updating the data formatting process, the steps to load in data and export it were standardized such that they're taking in a LANGUAGE variable as well as one for QUERIED_DATA_TYPE. This can be seen for example in German/nouns/format_nouns.py. It would be great if the lines for importing the Wikidata data as well as those for exporting the final output to the formatted_data directories could be extracted to common functions that could then be loaded in and ran from the each of the formatting files 😊

Contribution

Happy to support someone on this or get to it myself eventually! This is a great good first issue for someone wanting to get into Scribe a bit 😊

andrewtavis commented 7 months ago

Hey @shashank-iitbhu 👋 Can you write in here so I can assign :)

shashank-iitbhu commented 7 months ago

QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json"

Can't seem to find QUERIED_DATA_FILE. Do i need to run another operation before trying to run format_nouns.py?

andrewtavis commented 7 months ago

Hey @shashank-iitbhu 👋 Was part of the demonstration on Saturday, but not quite as visible. If you look at the end of the formatting files you'll see that this file is deleted, so we query this JSON from Wikidata and then delete it after the formatting step. Hence the files aren't in the repo :)

andrewtavis commented 7 months ago

Is the output from a run of update_data.py? If it's just got formatting step being ran, then no stress! You won't have the file then.

shashank-iitbhu commented 7 months ago

Oh, Got it! The output was from the formatting step only. I am able to run update_data.py successfully.

ikeadeoyin commented 7 months ago

@shashank-iitbhu how did you resolve this issue? I am having the same error when I try to run format_nouns.py.

shashank-iitbhu commented 7 months ago

@shashank-iitbhu how did you resolve this issue? I am having the same error when I try to run format_nouns.py.

Are you getting scribe_data module not found error or ***_queried.json file not found? Adding those lines mentioned in element chat should resolve the scribe_data module not found error. As in update_data.py, the ***_queried.json files are deleted after the formatting process, these are not present in the codebase, that's why format_nouns.py can't be run independently.

ikeadeoyin commented 7 months ago

I am getting the ***_queried.json file not found error


python3 src/scribe_data/extract_transform/languages/German/nouns/format_nouns.py
Traceback (most recent call last):
  File "/Users/ikeadeoyin/Documents/WikimediaGSoC2024/Scribe-Data/src/scribe_data/extract_transform/languages/German/nouns/format_nouns.py", line 40, in <module>
    with open(data_path, encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/ikeadeoyin/Documents/WikimediaGSoC2024//Scribe-Data/src/scribe_data/extract_transform/languages/German/nouns/nouns_queried.json'

ikeadeoyin commented 7 months ago

So I need to run the update_data.pybefore running format_nouns.py?

shashank-iitbhu commented 7 months ago

So I need to run the update_data.pybefore running format_nouns.py?

update_data.py is the main data process which triggers SPARQL queries to query language data from Wikidata and runs the formatting operation by running all the format_***.py files. You just need to run update_data.py, no need to run format_nouns.py after that.

ikeadeoyin commented 7 months ago

So I need to run the update_data.pybefore running format_nouns.py?

update_data.py is the main data process which triggers SPARQL queries to query language data from Wikidata and runs the formatting operation by running all the format_***.py files. You just need to run update_data.py, no need to run format_nouns.py after that.

Thank you so much! I was able to run update_data.py successfully.

andrewtavis commented 7 months ago

Thanks to you both for working this through!