scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
30 stars 69 forks source link

Refine CLI User Experience by Validating Input Languages and Data Types #328

Closed DeleMike closed 1 month ago

DeleMike commented 1 month ago

Terms

Behavior

Summary

When users provide a non-existent language or data type to the total command, the system incorrectly returns a number of lexemes. This leads to confusion and undermines the user experience.

Steps to Reproduce

  1. Run the total command with an invalid language or data type.
  2. Observe that the command returns a number rather than indicating that the language or data type does not exist.

Example

Initially when we run this command scribe-data t -lang Latin and we had:

lang filter =  ?lexeme dct:language ?language . # added this while trying to debug
Language: Latin # and this...
Total number of lexemes: 1344820

After we added some print statements, we see that the language_filter was not updating the language parameter hence giving a wrong result.

You can see the same thing for French: scribe-data t -lang French, and we had:

Lang filter =  ?lexeme dct:language wd:Q150 .
Language: French
Total number of lexemes: 19746

This shows inconsistent behaviour.

Expected Behavior

The command should validate the provided language and data type. If either does not exist, the system should gracefully return without executing the query and also suggest to the user what they can do to resolve it.

Root Cause

The current implementation lacks validation checks for the existence of input languages and data types in the metadata files. Specifically, the language_metadata.json file plays a crucial role in this issue. It serves as the authoritative source for valid languages and their corresponding QIDs. When a user inputs a language or data type that is not present in this file, the CLI does not recognize it as invalid and proceeds with the query. This oversight results in misleading output and a poor user experience

Proposed Solution

Related Issues

This issue is closely related to #295 as it has to do with CLI

Contribution

I would love to work and collaborate on implementing this improvement.

DeleMike commented 1 month ago

Hi @andrewtavis, I found a bug while trying to get total lexemes and I saw that it was connected to the language_metadata.json file. I have worked on an initial fix (a PR), which I will soon drop so that you can see my reasoning on how I propose we fix it.

Can you assign this issue to me?

DeleMike commented 1 month ago

@catreedle, after our long talk about this issue yesterday, I have created the issue and raised an initial PR here