scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
23 stars 25 forks source link

Investigate and implement `LIMIT` and `OFFSET` within queries #156

Open andrewtavis opened 3 months ago

andrewtavis commented 3 months ago

Terms

Description

This issue is a new version of the deleted #130 that came from #124, and also is related to #68. Scribe will at one point likely need to have LIMIT and OFFSET within the queries such that they can finish. As of now a solution was found for the issue in #124, but there could come a time when the queries would not finish. Figuring this out would allow us to have confidence that the query process for Scribe-Data is robust, regardless of the size of the Wikidata Query Service response.

Contribution

Would be very happy to investigate this going forward and help implement. The general idea was that we would query the total for a language and word type pair and then break the query down with LIMIT and OFFSET being iterated over based on the total number of results. Keeping the total returned to ~50,000 should be fine, but we can also test this with different queries.

Note that this issue is not of high priority, but could be something that we look at later :)

andrewtavis commented 3 months ago

CC @wkyoshida 😊

henrikth93 commented 2 months ago

I am interested in this!

andrewtavis commented 2 months ago

Hey @henrikth93 👋 Let's maybe hold off on this one until GSoC's all done, as there's no real need for it now :) We can discuss in the sync or a call between the two of us what might be the best next thing to work on!