scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
26 stars 38 forks source link

[Deleted] Explore formatting data with SQLite rather than Python directly #47

Closed andrewtavis closed 7 months ago

andrewtavis commented 1 year ago

Terms

Description

One of the major issues with Scribe-Data at time of writing is that we have the formatting for all the language data within relatively large/complex format_WORD_TYPE.py scripts. A general thought within the team is that this could be simplified by converting these processes over to use SQLite via sqlite3. Rather than loading in JSON files and formatting them using conditionals in a dictionary structure, the raw JSONs could be loaded as a table with the final output being a conditional selection from this table.

This issue could just be the creation of a proof of concept that this cane work, and from there we expand to converting the formatting processes over ๐Ÿš€

There's also the potential to do this with SPARQL on the Wikidata end, but we already are needing to break up the files because the rate limits are hit, which would only get worse with more complex selections. I'd say that this would be the ideal way of doing this :)

Contribution

Happy to work on this myself or support someone who'd like to contribute! ๐Ÿ˜Š

andrewtavis commented 1 year ago

CC @lillian-mo and @wkyoshida for the discussion here :)

andrewtavis commented 7 months ago

Closing this issue as the goal is now that #59 would cover this along with the decision that our data exports should directly match Wikidata data structures. Rather than Scribe creating combined data based on strings, we'll instead stick to the given lexeme based entries and change how the iOS app and others are referencing the provided data packs. The work for this can thus be handled in #59 ๐Ÿ˜Š