scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
30 stars 69 forks source link

Implement CLI `--total` functionality #162

Closed mhmohona closed 3 months ago

mhmohona commented 4 months ago

Contributor checklist


Description

Implemented the --total (-t) functionality would check Wikidata for the total of certain groupings of languages and word types.

image

Related issue

Fixes - #147

github-actions[bot] commented 4 months ago

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

mhmohona commented 4 months ago

It still has a problem. It is recognizing data-type as language like this - image

andrewtavis commented 4 months ago

Any thoughts on what's causing it, @mhmohona? :)

andrewtavis commented 4 months ago

Minor comments so far:

Language: German
Data type: Verbs
Total number of lexemes: 999

We'll be good to go after all this! 🥳

andrewtavis commented 4 months ago

Quick check here @mhmohona, are you planning on getting to the changes we mentioned above?

wkyoshida commented 4 months ago

It still has a problem. It is recognizing data-type as language like this - image

Doesn't this work since -l German is passed first? Could just be that the first argument is recognized and used. The second may have been passed, but it's disregarded.

Looks like there is a choices parameter that we can pass in to add_argument() to specify what are the allowable options for an argument. Maybe we look if specifying this could make sense as the mechanism for controlling valid/invalid inputs?

mhmohona commented 4 months ago

Currently stuck in this comment 😢

Let's move data_type_to_qid to the data_type_metadata.json file

andrewtavis commented 4 months ago

By this I mean let's move the functionality of data_type_to_qid to the data_type_metadata.json file such that we just import the data at the top of the file rather than using a custom function :)

mhmohona commented 4 months ago

It still has a problem. It is recognizing data-type as language like this - image

Doesn't this work since -l German is passed first? Could just be that the first argument is recognized and used. The second may have been passed, but it's disregarded.

Looks like there is a choices parameter that we can pass in to add_argument() to specify what are the allowable options for an argument. Maybe we look if specifying this could make sense as the mechanism for controlling valid/invalid inputs?

@wkyoshida, thank you for looking into it. I have solved this problem.

mhmohona commented 4 months ago

By this I mean let's move the functionality of data_type_to_qid to the data_type_metadata.json file such that we just import the data at the top of the file rather than using a custom function :)

I need help with the QID. :( Unable to find the correct QIDs. It would be super helpful for me if you would update the data_type_metadata.json file with QIDs @andrewtavis.

mhmohona commented 4 months ago

I

Minor comments so far:

  • Let's include the language and the data type in the output
  • From @wkyoshida, something like:
Language: German
Data type: Verbs
Total number of lexemes: 999
  • Let's move data_type_to_qid to the data_type_metadata.json file
  • Let's reference the language_metadata.json file for the language_to_qid information

We'll be good to go after all this! 🥳

I have addressed 1st and 3rd feedback from here.

image