Implement CLI `--total` functionality

andrewtavis commented 2 weeks ago

Terms

[X] I have searched open and closed feature requests
[X] I agree to follow Scribe-Data's Code of Conduct

Description

This issue would implement the --total (-t) functionality of the Scribe-Data CLI. This functionality would check Wikidata for the total of certain groupings of languages and word types. Usage of this would be:

scribe-data total -l German -wt nouns  # number of German noun lexemes
scribe-data total -l German  # number of German lexemes
scribe-data total -wt nouns  # number of noun lexemes

Note: it would be good to allow the user to pass nouns or noun, etc, in order to avoid unneeded errors :)

The following Python code could be edited for most of the functionality that we need for this, whereby we could also add some changes such that the word_type argument would also function :) From there the result of this function is returned to the user with a message including the given language and/or word types.

from SPARQLWrapper import SPARQLWrapper, JSON

def get_total_lexemes(language, word_type):
    endpoint_url = "https://query.wikidata.org/sparql"
    sparql = SPARQLWrapper(endpoint_url)

    # SPARQL query template.
    query_langage_template = """
    SELECT 
        (COUNT(DISTINCT ?lexeme) as ?total)

    WHERE {{
      VALUES ?language { wd:{} }
      ?lexeme dct:language ?language ;
              wikibase:lexicalCategory ?category .
    """

    filter_word_template  = """
         FILTER(?category IN ( { wd:{} } ))
    """

    end_of_query = """
    }}
    """

    if word_type:
        query_langage_template += filter_word_template

    query_langage_template += end_of_query

    # Replace {} in the query template with the language value.
    query = query_langage_template.format(language)  # , word_type  # <-- we want to include this and have this also be repalcesd

    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()

    return int(results["results"]["bindings"][0]["total"]["value"])

Note: the above function also needs to be able to accept lists, so it should be languages and word_types :)

Contribution

@mhmohona will be working on this as a part of GSoC 2024! ☀️ Please write in here so I can assign, and let us know if there's anything we can do to support!

mhmohona commented 2 weeks ago

Thank you for detail explanation. 😄

andrewtavis commented 2 weeks ago

Very welcome! 🥳🥳

andrewtavis commented 2 weeks ago

One thing to note here, we should likely allow the user to pass either noun or nouns, etc. Just so it's easier :) Adding this to the issue 😊

andrewtavis commented 4 days ago

One thing to note here, we should likely allow the user to pass either noun or nouns, etc. Just so it's easier :) Adding this to the issue 😊

We can use the following for this, @mhmohona: https://github.com/scribe-org/Scribe-Data/blob/11f4f94d28efa167a3ce61fa6229d597e800a833/src/scribe_data/cli/cli_utils.py#L46

I think that working on this one would be a great next step, @mhmohona! This would give you a bit of Wikidata experience as well :) I'll add in the files and the section for the CLI now!

andrewtavis commented 4 days ago

3736222 adds in the basics for this, @mhmohona :) The work for this command can be in cli/total.py, and the command structure has already been added into cli/main.py. I think we should be able to work with the Python code in the issue text and SPARQLWrapper to make this work. Happy to discuss further!

scribe-org / Scribe-Data