Closed DeleMike closed 1 month ago
The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)
If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!
[x] The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution
git config user.email
in their local Scribe-Data repo[x] The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
[x] The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)
Hi @DeleMike, Great job on this! It's impressive how you've come up with this solution.
Just a couple of things to consider:
Let me know your thoughts when you get a chance. :)
I was browsing through the issues and found one that might be related to this. https://github.com/scribe-org/Scribe-Data/issues/293
Hi @DeleMike, Great job on this! It's impressive how you've come up with this solution.
Just a couple of things to consider:
- We might need to revisit total.py later, as there could be some changes related to this PR.
- Do we think it's safe to allow users to update the language_metadata? Specifically, does this handle cases where a user might try to update an existing language (e.g., English) with an incorrect QID?
Let me know your thoughts when you get a chance. :)
Yes, we might need to revisit it after it has been merged. Thanks @catreedle!
concerning the ability to allow users (the developer) to update the language metadata is a good option IMHO. In Flutter and other systems, there's always a need to sometimes update your dependencies or refetch them(flutter pub get
) just to make sure they are cached properly/available for use. However, I agree with you when you say, what if the user updates, say, "English" with the wrong QID, how do we tackle that?
I will think of ways soon and drop them in this discussion.
I was browsing through the issues and found one that might be related to this. #293
hmm... did not see that. It is closely related, right ? @andrewtavis. What can we do about this?
One thing, this issue originated when the Scribe-Data
was returning some kind of default values for languages it could not find in the language_metadata.json
file. Hence, I came up with this initial fix which led to the procedure of updating the metadata
# when we want to run `scribe-data t -l Latin` and Latin doesn't exist in the metadata hence we cannot provide a QID for it
language_qid = get_qid_by_input(language)
data_type_qid = get_qid_by_input(data_type)
if not language_qid:
print(
"The specified language does not exist. Please update your language_metadata.json file by using:\n"
"`scribe-data update --metadata`\n"
"Alternatively, you can manually set it with:\n"
"`scribe-data set-metadata -lang [your_language] -qid [your_qid]`.\n\n"
"This will ensure that you can fetch the correct data."
)
return
We will basically inform the user that the specified language does not exist and they should update the metadata file. We also terminate the whole process early with the return keyword so that we exit from the procedure gracefully after telling the user what to do.
Output of running scribe-data t -l Latin
if it does not exist:
The specified language does not exist. Please update your language_metadata.json file by using:
`scribe-data update --metadata`
Alternatively, you can manually set it with:
`scribe-data set-metadata -lang [your_language] -qid [your_qid]`.
This will ensure that you can fetch the correct data.
Ok @DeleMike, I've had a bit of time to look this over. Big thing is that this functionality is useful now, but shouldn't be in the future as we shouldn't be releasing Scribe-Data or merging in language querying files if the language isn't in the metadata file. @OmarAI2003 will be expanding the language metadata file in #293, so then this should be fine π€ I agree with @catreedle that we don't necessarily want the user to have control over these internal files (edit: I'm unsure on this now).
There is definitely some value here though π Specifically maybe we can change this so that it does the following:
/src/scribe_data/check_language_metadata.py
that would include some of the functionality you have here to check the language_data_extraction
directory to see if the languages in the directory are included in the metadata filecheck_language_metadata.py
and throw an error if the language_data_extraction
directory hasn't been updatedHow does this sound?
Edit: there's even a more expansive suggestion for how to use this code here. Really interesting possibilities!
Specifically of interest to me now though, @DeleMike and @catreedle, and maybe something that you two would like to take on and maybe @DeleMike can take the lead: We could repurpose this code as well to have another check of the actual queries. This idea came from extract_qid_from_sparql
. There obviously is a lot of copying and pasting going on, and I've found some errors in the PRs, but maybe some slipped though. We should also have a GitHub Actions workflow that will check all queries for the following:
We'd be able to use the language and data type metadata files for this. For the first test we're seeing if the first QID is the correct language via parsing ?lexeme dct:language wd:Q12345
(we'd need to continue to write like this, but it's fine). From there for the data type check we can just check to see if any of the QIDs for the other data types are in the query - i.e. "Hey this is an adjectives query. Why is the QID for nouns in here???".
Let me know how the above sounds! I think that the above serves to make use of the code here in intuitive ways so the efforts aren't wasted, and we get a great PR testing setup that checks the metadata files and queries to make sure that we're not introducing things that don't work π
Specifically of interest to me now though, @DeleMike and @catreedle, and maybe something that you two would like to take on and maybe @DeleMike can take the lead: We could repurpose this code as well to have another check of the actual queries. This idea came from
extract_qid_from_sparql
. There obviously is a lot of copying and pasting going on, and I've found some errors in the PRs, but maybe some slipped though. We should also have a GitHub Actions workflow that will check all queries for the following:
- Is the language QID the one that it's supposed to be given the directory
- Is the data type the one it's supposed to be given the directory
We'd be able to use the language and data type metadata files for this. For the first test we're seeing if the first QID is the correct language via parsing
?lexeme dct:language wd:Q12345
(we'd need to continue to write like this, but it's fine). From there for the data type check we can just check to see if any of the QIDs for the other data types are in the query - i.e. "Hey this is an adjectives query. Why is the QID for nouns in here???".Let me know how the above sounds! I think that the above serves to make use of the code here in intuitive ways so the efforts aren't wasted, and we get a great PR testing setup that checks the metadata files and queries to make sure that we're not introducing things that don't work π
This is very thoughtful. I'm willing to work/collaborate on it. π
Hey, @andrewtavis and @catreedle thanks folks for the reviewπ. so helpful. I am going through everything as I type.
I just wanted to drop my initial thoughts. From all I have read, we can generate a set of new issues from the works in this PR. I will be dropping my comments!
I think we should definitely get this down to a point where we can merge and use the code from it as example codes within the issues we make π Let me know if I should make the issues and tag you both!
Specifically of interest to me now though, @DeleMike and @catreedle, and maybe something that you two would like to take on and maybe @DeleMike can take the lead: We could repurpose this code as well to have another check of the actual queries. This idea came from
extract_qid_from_sparql
. There obviously is a lot of copying and pasting going on, and I've found some errors in the PRs, but maybe some slipped though. We should also have a GitHub Actions workflow that will check all queries for the following:
- Is the language QID the one that it's supposed to be given the directory
- Is the data type the one it's supposed to be given the directory
We'd be able to use the language and data type metadata files for this. For the first test we're seeing if the first QID is the correct language via parsing
?lexeme dct:language wd:Q12345
(we'd need to continue to write like this, but it's fine). From there for the data type check we can just check to see if any of the QIDs for the other data types are in the query - i.e. "Hey this is an adjectives query. Why is the QID for nouns in here???".Let me know how the above sounds! I think that the above serves to make use of the code here in intuitive ways so the efforts aren't wasted, and we get a great PR testing setup that checks the metadata files and queries to make sure that we're not introducing things that don't work π
Lovely! I would love to improve this code.
From the quote, I understand that:
extract_qid_from_sparql
, to check the accuracy of queries.I think we should definitely get this down to a point where we can merge and use the code from it as example codes within the issues we make π Let me know if I should make the issues and tag you both!
yes, please. we would love to work on the issues created! :)
Ok, let me know how you want to proceed here, @DeleMike :) I'm happy to clean this PR up with the code we want for now, and then also make the issues at the same time using the current functions as snippets. Let me know if you'd like to do any changes before - so I'll wait for a confirmation here and then finalize this one π
Yeah, true! For now, I'd rather you help us point in the direction we should go. That is, if there's any extra addition I'd make at this moment, it would all stem from your suggestions hence I would be happy if you could help adjust the PR now, and then make issues from it. The main reason is that we don't create a future complication. How about this?
or is there something I could quickly do so that it does not affect our codebase? the docstrings format issue?
I can get to the docstrings really quick, @DeleMike :) No stress at all. Just wanted to note it for the next time!
I'll do a quick fix of this one soon then and then make issues for the workflows π
I can get to the docstrings really quick, @DeleMike :) No stress at all. Just wanted to note it for the next time!
oh okay! noted π
I'll do a quick fix of this one soon then and then make issues for the workflows π
I'll be waiting for them :)
Thanks @andrewtavis for the support!
and thank you @catreedle. Let's get ready!! πͺπ
alright, checking...
I decided against doing set metadata as well, @DeleMike, as I think we need to focus on making sure that the CLI options are for the end users and not for the development team. Thanks so much for these suggestions though! I think that the workflows are going to improve the project so much, and all of this was the catalyst π
That's okay :-) And you are welcome!
Contributor checklist
Description
This PR enhances the Scribe-Data CLI by validating user input for languages and data types. If a non-existent language is specified, the user will receive a clear message informing them that the language does not exist, along with instructions on how to update their language_metadata.json file using:
Alternatively, users can manually set the metadata with:
These enhancements aim to improve user experience and ensure accurate data retrieval in the Scribe-Data CLI.
Related issue
This issue is closely related to #295