DeleMike commented 1 month ago

Implemented checks for non-existent languages and data types in the total command.
Added informative error messages guiding users to update or set their language metadata.
Enhanced feedback for improved usability of the CLI.

Contributor checklist

[x] This pull request is on a separate branch and not the main branch

Description

This PR enhances the Scribe-Data CLI by validating user input for languages and data types. If a non-existent language is specified, the user will receive a clear message informing them that the language does not exist, along with instructions on how to update their language_metadata.json file using:

scribe-data update --metadata

Alternatively, users can manually set the metadata with:

scribe-data set-metadata -lang [your_language] -qid [your_qid]

These enhancements aim to improve user experience and ensure accurate data retrieval in the Scribe-Data CLI.

Related issue

This issue is closely related to #295

fixes #328

github-actions[bot] commented 1 month ago

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

[x] The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution
- The contributor's name and icon in remote commits should be the same as what appears in the PR
- If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo
[x] The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
[x] The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

catreedle commented 1 month ago

Hi @DeleMike, Great job on this! It's impressive how you've come up with this solution.

Just a couple of things to consider:

We might need to revisit total.py later, as there could be some changes related to this PR.
Do we think it's safe to allow users to update the language_metadata? Specifically, does this handle cases where a user might try to update an existing language (e.g., English) with an incorrect QID?

Let me know your thoughts when you get a chance. :)

catreedle commented 1 month ago

I was browsing through the issues and found one that might be related to this. https://github.com/scribe-org/Scribe-Data/issues/293

DeleMike commented 1 month ago

Hi @DeleMike, Great job on this! It's impressive how you've come up with this solution.

Just a couple of things to consider:

We might need to revisit total.py later, as there could be some changes related to this PR.

Do we think it's safe to allow users to update the language_metadata? Specifically, does this handle cases where a user might try to update an existing language (e.g., English) with an incorrect QID?

Let me know your thoughts when you get a chance. :)

Yes, we might need to revisit it after it has been merged. Thanks @catreedle!

concerning the ability to allow users (the developer) to update the language metadata is a good option IMHO. In Flutter and other systems, there's always a need to sometimes update your dependencies or refetch them(flutter pub get) just to make sure they are cached properly/available for use. However, I agree with you when you say, what if the user updates, say, "English" with the wrong QID, how do we tackle that?

I will think of ways soon and drop them in this discussion.

DeleMike commented 1 month ago

I was browsing through the issues and found one that might be related to this. #293

hmm... did not see that. It is closely related, right ? @andrewtavis. What can we do about this?

One thing, this issue originated when the Scribe-Data was returning some kind of default values for languages it could not find in the language_metadata.json file. Hence, I came up with this initial fix which led to the procedure of updating the metadata

 # when we want to run `scribe-data t -l Latin` and Latin doesn't exist in the metadata hence we cannot provide a QID for it
 language_qid = get_qid_by_input(language)
 data_type_qid = get_qid_by_input(data_type)

 if not language_qid:
        print(
            "The specified language does not exist. Please update your language_metadata.json file by using:\n"
            "`scribe-data update --metadata`\n"
            "Alternatively, you can manually set it with:\n"
            "`scribe-data set-metadata -lang [your_language] -qid [your_qid]`.\n\n"
            "This will ensure that you can fetch the correct data."
        )
        return

We will basically inform the user that the specified language does not exist and they should update the metadata file. We also terminate the whole process early with the return keyword so that we exit from the procedure gracefully after telling the user what to do.

Output of running scribe-data t -l Latin if it does not exist:

The specified language does not exist. Please update your language_metadata.json file by using:
`scribe-data update --metadata`
Alternatively, you can manually set it with:
`scribe-data set-metadata -lang [your_language] -qid [your_qid]`.

This will ensure that you can fetch the correct data.

andrewtavis commented 1 month ago

Ok @DeleMike, I've had a bit of time to look this over. Big thing is that this functionality is useful now, but shouldn't be in the future as we shouldn't be releasing Scribe-Data or merging in language querying files if the language isn't in the metadata file. @OmarAI2003 will be expanding the language metadata file in #293, so then this should be fine 🤔 I agree with @catreedle that we don't necessarily want the user to have control over these internal files (edit: I'm unsure on this now).

There is definitely some value here though 😊 Specifically maybe we can change this so that it does the following:

A new file /src/scribe_data/check_language_metadata.py that would include some of the functionality you have here to check the language_data_extraction directory to see if the languages in the directory are included in the metadata file
A GitHub workflow in .github/workflows that runs on PRs that would run check_language_metadata.py and throw an error if the language_data_extraction directory hasn't been updated

How does this sound?

Edit: there's even a more expansive suggestion for how to use this code here. Really interesting possibilities!

andrewtavis commented 1 month ago

Specifically of interest to me now though, @DeleMike and @catreedle, and maybe something that you two would like to take on and maybe @DeleMike can take the lead: We could repurpose this code as well to have another check of the actual queries. This idea came from extract_qid_from_sparql. There obviously is a lot of copying and pasting going on, and I've found some errors in the PRs, but maybe some slipped though. We should also have a GitHub Actions workflow that will check all queries for the following:

Is the language QID the one that it's supposed to be given the directory
Is the data type the one it's supposed to be given the directory

We'd be able to use the language and data type metadata files for this. For the first test we're seeing if the first QID is the correct language via parsing ?lexeme dct:language wd:Q12345 (we'd need to continue to write like this, but it's fine). From there for the data type check we can just check to see if any of the QIDs for the other data types are in the query - i.e. "Hey this is an adjectives query. Why is the QID for nouns in here???".

Let me know how the above sounds! I think that the above serves to make use of the code here in intuitive ways so the efforts aren't wasted, and we get a great PR testing setup that checks the metadata files and queries to make sure that we're not introducing things that don't work 😊

catreedle commented 1 month ago

Specifically of interest to me now though, @DeleMike and @catreedle, and maybe something that you two would like to take on and maybe @DeleMike can take the lead: We could repurpose this code as well to have another check of the actual queries. This idea came from extract_qid_from_sparql. There obviously is a lot of copying and pasting going on, and I've found some errors in the PRs, but maybe some slipped though. We should also have a GitHub Actions workflow that will check all queries for the following:

Is the language QID the one that it's supposed to be given the directory

Is the data type the one it's supposed to be given the directory

We'd be able to use the language and data type metadata files for this. For the first test we're seeing if the first QID is the correct language via parsing ?lexeme dct:language wd:Q12345 (we'd need to continue to write like this, but it's fine). From there for the data type check we can just check to see if any of the QIDs for the other data types are in the query - i.e. "Hey this is an adjectives query. Why is the QID for nouns in here???".

Let me know how the above sounds! I think that the above serves to make use of the code here in intuitive ways so the efforts aren't wasted, and we get a great PR testing setup that checks the metadata files and queries to make sure that we're not introducing things that don't work 😊

This is very thoughtful. I'm willing to work/collaborate on it. 😊

DeleMike commented 1 month ago

Hey, @andrewtavis and @catreedle thanks folks for the review😄. so helpful. I am going through everything as I type.

I just wanted to drop my initial thoughts. From all I have read, we can generate a set of new issues from the works in this PR. I will be dropping my comments!

andrewtavis commented 1 month ago

I think we should definitely get this down to a point where we can merge and use the code from it as example codes within the issues we make 😊 Let me know if I should make the issues and tag you both!

DeleMike commented 1 month ago

Specifically of interest to me now though, @DeleMike and @catreedle, and maybe something that you two would like to take on and maybe @DeleMike can take the lead: We could repurpose this code as well to have another check of the actual queries. This idea came from extract_qid_from_sparql. There obviously is a lot of copying and pasting going on, and I've found some errors in the PRs, but maybe some slipped though. We should also have a GitHub Actions workflow that will check all queries for the following:

Is the language QID the one that it's supposed to be given the directory

Is the data type the one it's supposed to be given the directory

We'd be able to use the language and data type metadata files for this. For the first test we're seeing if the first QID is the correct language via parsing ?lexeme dct:language wd:Q12345 (we'd need to continue to write like this, but it's fine). From there for the data type check we can just check to see if any of the QIDs for the other data types are in the query - i.e. "Hey this is an adjectives query. Why is the QID for nouns in here???".

Let me know how the above sounds! I think that the above serves to make use of the code here in intuitive ways so the efforts aren't wasted, and we get a great PR testing setup that checks the metadata files and queries to make sure that we're not introducing things that don't work 😊

Lovely! I would love to improve this code.

From the quote, I understand that:

We want to repurpose existing code, particularly from extract_qid_from_sparql, to check the accuracy of queries.
We want Language QID Check and Data Type Check in our workflows for validation.

DeleMike commented 1 month ago

I think we should definitely get this down to a point where we can merge and use the code from it as example codes within the issues we make 😊 Let me know if I should make the issues and tag you both!

yes, please. we would love to work on the issues created! :)

andrewtavis commented 1 month ago

Ok, let me know how you want to proceed here, @DeleMike :) I'm happy to clean this PR up with the code we want for now, and then also make the issues at the same time using the current functions as snippets. Let me know if you'd like to do any changes before - so I'll wait for a confirmation here and then finalize this one 😊

DeleMike commented 1 month ago

Yeah, true! For now, I'd rather you help us point in the direction we should go. That is, if there's any extra addition I'd make at this moment, it would all stem from your suggestions hence I would be happy if you could help adjust the PR now, and then make issues from it. The main reason is that we don't create a future complication. How about this?

or is there something I could quickly do so that it does not affect our codebase? the docstrings format issue?

andrewtavis commented 1 month ago

I can get to the docstrings really quick, @DeleMike :) No stress at all. Just wanted to note it for the next time!

andrewtavis commented 1 month ago

I'll do a quick fix of this one soon then and then make issues for the workflows 🚀

DeleMike commented 1 month ago

I can get to the docstrings really quick, @DeleMike :) No stress at all. Just wanted to note it for the next time!

oh okay! noted 😄

DeleMike commented 1 month ago

I'll do a quick fix of this one soon then and then make issues for the workflows 🚀

I'll be waiting for them :)

Thanks @andrewtavis for the support!

and thank you @catreedle. Let's get ready!! 💪😄

andrewtavis commented 1 month ago

339 and #340 have been made and I pinged you each in them :) Let me know if there are any questions!

DeleMike commented 1 month ago

alright, checking...

DeleMike commented 1 month ago

I decided against doing set metadata as well, @DeleMike, as I think we need to focus on making sure that the CLI options are for the end users and not for the development team. Thanks so much for these suggestions though! I think that the workflows are going to improve the project so much, and all of this was the catalyst 😊

That's okay :-) And you are welcome!

scribe-org / Scribe-Data

fix: refiine cli to resolve languages data fetch incosistency #330

Contributor checklist

Description

Related issue

Thank you for the pull request!

Maintainer checklist

339 and #340 have been made and I pinged you each in them :) Let me know if there are any questions!