scribe-org / Scribe-Data

Wikidata and Wikipedia language data extraction
GNU General Public License v3.0
18 stars 18 forks source link

Implement CLI `--total` functionality #147

Open andrewtavis opened 2 weeks ago

andrewtavis commented 2 weeks ago

Terms

Description

This issue would implement the --total (-t) functionality of the Scribe-Data CLI. This functionality would check Wikidata for the total of certain groupings of languages and word types. Usage of this would be:

scribe-data total -l German -wt nouns  # number of German noun lexemes
scribe-data total -l German  # number of German lexemes
scribe-data total -wt nouns  # number of noun lexemes

The following Python code could be edited for most of the functionality that we need for this, whereby we could also add some changes such that the word_type argument would also function :) From there the result of this function is returned to the user with a message including the given language and/or word types.

from SPARQLWrapper import SPARQLWrapper, JSON

def get_total_lexemes(language, word_type):
    endpoint_url = "https://query.wikidata.org/sparql"
    sparql = SPARQLWrapper(endpoint_url)

    # SPARQL query template.
    query_langage_template = """
    SELECT 
        (COUNT(DISTINCT ?lexeme) as ?total)

    WHERE {{
      VALUES ?language { wd:{} }
      ?lexeme dct:language ?language ;
              wikibase:lexicalCategory ?category .
    """

    filter_word_template  = """
         FILTER(?category IN ( { wd:{} } ))
    """

    end_of_query = """
    }}
    """

    if word_type:
        query_langage_template += filter_word_template

    query_langage_template += end_of_query

    # Replace {} in the query template with the language value.
    query = query_langage_template.format(language)  # , word_type  # <-- we want to include this and have this also be repalcesd

    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()

    return int(results["results"]["bindings"][0]["total"]["value"])

Contribution

@mhmohona will be working on this as a part of GSoC 2024! ☀️ Please write in here so I can assign, and let us know if there's anything we can do to support!

mhmohona commented 2 weeks ago

Thank you for detail explanation. 😄

andrewtavis commented 2 weeks ago

Very welcome! 🥳🥳

andrewtavis commented 2 weeks ago

One thing to note here, we should likely allow the user to pass either noun or nouns, etc. Just so it's easier :) Adding this to the issue 😊

andrewtavis commented 4 days ago

One thing to note here, we should likely allow the user to pass either noun or nouns, etc. Just so it's easier :) Adding this to the issue 😊

We can use the following for this, @mhmohona: https://github.com/scribe-org/Scribe-Data/blob/11f4f94d28efa167a3ce61fa6229d597e800a833/src/scribe_data/cli/cli_utils.py#L46

I think that working on this one would be a great next step, @mhmohona! This would give you a bit of Wikidata experience as well :) I'll add in the files and the section for the CLI now!

andrewtavis commented 4 days ago

3736222 adds in the basics for this, @mhmohona :) The work for this command can be in cli/total.py, and the command structure has already been added into cli/main.py. I think we should be able to work with the Python code in the issue text and SPARQLWrapper to make this work. Happy to discuss further!