Closed andrewtavis closed 3 weeks ago
Hi, @andrewtavis , @DeleMike and @catreedle, Can I also contribute to this issue?
Thanks @andrewtavis. I would love to be assigned to this issue. I would get started on it soon :)
Let's definitely let @DeleMike and @catreedle do the PRs for here and #340, @KesharwaniArpita, as @DeleMike was the writer of the snippets and @catreedle did the initial reviews :) I'll let them say if they want support here, but maybe you could do #341?
I was thinking, @catreedle , do you think we could work together on this issue?
we could break this into two PRs, one for checking language appropriateness and the other for data type appropriateness? is this okay? @andrewtavis
I was thinking that #341 could be suited for @KesharwaniArpita ? ... it might be easier since @KesharwaniArpita was not in our initial discussions. That issue seems self-explanatory. what do you think? @catreedle
How about this @andrewtavis ?
Ok, I get it. Thanks for telling me about the discussion βΊοΈ
Let's definitely let @DeleMike and @catreedle do the PRs for here and #340, @KesharwaniArpita, as @DeleMike was the writer of the snippets and @catreedle did the initial reviews :) I'll let them say if they want support here, but maybe you could do #341?
Ah yes! I did not see this!! You are right. we'll wait for feedback from @catreedle if she's comfortable with this :)
Assigning @KesharwaniArpita in so far as it'd be great if you all would discuss the implementation together, but as @KesharwaniArpita's on #341 my assumption is that the coding for this and #340 are done by @DeleMike and @catreedle π
I'll happily be the learner here!!! π π Thanks for considering me!!!!
I was thinking, @catreedle , do you think we could work together on this issue?
we could break this into two PRs, one for checking language appropriateness and the other for data type appropriateness? is this okay? @andrewtavis
I was thinking that #341 could be suited for @KesharwaniArpita ? ... it might be easier since @KesharwaniArpita was not in our initial discussions. That issue seems self-explanatory. what do you think? @catreedle
How about this @andrewtavis ?
Good idea, @DeleMike. :) Which one are you working on? I'll take the other.
Thank you for the learning opportunity, @andrewtavis @DeleMike @KesharwaniArpita π
Great @catreedle! How about this? Let's work on two PRs that will resolve this issue. This is how I propose:
check_query_identifiers.yaml
The first PR will be about setting up the GitHub Actions workflow file (check_query_identifiers.yaml).
Key tasks:
- Create the workflow inside the .github/workflows/ directory.
- Define triggers for the workflow (e.g., on push or pull_request).
- Call the Python script check_query_identifiers.py from this workflow.
- Ensure that the workflow captures output from the script and shows it clearly in the GitHub Actions logs.
check_query_identifiers.py
This is about writing the Python script that performs the actual checks
Key tasks:
- Perform the following checks: a) is the language QID
(?lexeme dct:language wd:Q12345)
correct based on the directory the file is in? b) is the data type QID (for nouns, verbs, etc.) correct according to the query?- Collect a list of files that fail the checks and print a user-friendly error message.
The script should show a list of queries that have incorrect QIDs, either for the language or data type.
which ones would you like to work on? I would prefer to work on implementing the check_query_identifiers.py but any is preferable for me. I have a fair understanding of GH Actions - so that'd be nice!
@andrewtavis, do you have any suggestions about this?
all PRs will be linked to this issue. But I believe it doesn't matter what number we choose to work with, we can work in parallel(to increase speed) and once we are done we can collaborate to ensure the workflow file connects with the method that does the checks in check_query_identifiers.py
.
In summary, what I mean is, Task 1 depends on Task 2 to be complete but it can be set up without the need of Task 2. do you understand @catreedle ?
All sounds good to me, @DeleMike :) We'll just need to make sure to bring in the .py file before we deploy the workflow π
Yes, @DeleMike. I can work on the GitHub workflow. Thank you for the clear explanation. π
Sure! Alright then! I'll raise a PR containing the checks.
Opened a PR. Feedbacks will be much appreciated. :)
Comment made :) Thanks much, @catreedle!
Another thing to note here all is whether we can write something to really verify the queries are correct at some point - so not in this issue. We're mostly using the same properties for each of the forms we're getting, but in various combinations. Maybe it would make sense to really do a naming convention for the forms, and then from there we can say that a form named in a certain way should be coming from wikibase:grammaticalFeature
QIDs that are this. I.e. FPSPresentIndicative
would have first person, singular, present and indicative in its OPTIONAL
block.
100% a different issue and likely a lot of work, but could be cool π
Yes, you are right @andrewtavis. We'll need to have a form of standardization. I think it might require some more extra time, especially when it comes to setting up a proper convention.
Very much so, but it'd be really beneficial once we figure it out :)
hey @andrewtavis, I have a question...now, I want to implement two checks, one for valid data type and also valid language.
to check for a valid data type, we can check the data_type_metadata.json
file because it looks like:
{
"adjectives": "Q34698",
"adverbs": "Q380057",
"articles": "Q103184",
"autosuggestions": "",
"conjunctions": "Q36484",
"emoji_keywords": "",
"nouns": "Q1084",
"personal_pronouns": "Q468801",
"postpositions": "Q161873",
"prepositions": "Q4833830",
"pronouns": "Q36224",
"proper_nouns": "Q147276",
"verbs": "Q24905"
}
but for language_metadata.json
, we don't have that(I mean, we have but we cannot rely on it). The file will almost not always have our complete languages in the language_data_extraction
directory.
Now, how do we verify if a language QID is valid or not? do we need extra research or maintain a dictionary of language:QID
pairs?
I believe data_type_metadata.json
is manageable and easy to update via hand.
what are your thoughts?
I'm also thinking @andrewtavis , about this:
expected_language_qid = language_directory_qid_map.get(directory_name) # might not work since language_metadata file is not full updated
expected_language_qid = fetch_language_qid_from_wikidata(directory_name) # might be reliable if the user has an internet connection
what's the best approach here?
Also see the validation check for valid_data_type
:
def is_valid_data_type(query_file, data_type_qid):
directory_name = query_file.parent.name # e.g., "nouns" or "verbs"
expected_data_type_qid = data_type_metadata.get(directory_name)
if data_type_qid != expected_data_type_qid:
print(f"Incorrect data type QID in {query_file}. Found: {data_type_qid}, Expected: {expected_data_type_qid}")
return False
return True
from the above, creating checks to know if a data_type
is okay for the directory it is under is done β
So if we have the file English/nouns/query_nouns.sparql
and if we put in a wrong QID, then we will throw something like this and return False
Incorrect data type QID in English/nouns/query_nouns.sparql. Found: Q1086, Expected: Q1084
hey @andrewtavis, I have a question...now, I want to implement two checks, one for valid data type and also valid language.
to check for a valid data type, we can check the
data_type_metadata.json
file because it looks like:{ "adjectives": "Q34698", "adverbs": "Q380057", "articles": "Q103184", "autosuggestions": "", "conjunctions": "Q36484", "emoji_keywords": "", "nouns": "Q1084", "personal_pronouns": "Q468801", "postpositions": "Q161873", "prepositions": "Q4833830", "pronouns": "Q36224", "proper_nouns": "Q147276", "verbs": "Q24905" }
but for
language_metadata.json
, we don't have that(I mean, we have but we cannot rely on it). The file will almost not always have our complete languages in thelanguage_data_extraction
directory.Now, how do we verify if a language QID is valid or not? do we need extra research or maintain a dictionary of
language:QID
pairs?
I've been thinking about this as well, @DeleMike :) A basic thing that we can do is use Wikidata for this to check that the QID is for a language given properties, but then running a workflow check based on an external service might not be the best, and it's again using it in a way that's not ideal as we would be making lots of relatively unneeded calls. So long as the queries are returning forms, we should be confident that they're valid, and it's also on us to check them as they're being added. The checks here will also make sure that all QIDs for a single language are the same.
If we really wanted to be sure, we could write say a monthly workflow that checks that all the QIDs we're using are languages given properties on Wikidata and we could also check that the labels that we have for them are similar to their labels on Wikidata - i.e. is our label included in the Wikidata label or if not semantically similar to it. I wouldn't run this more than once a month though to not overuse resources :)
Let me know what you think on the above!
hi @andrewtavis , thanks for your input! You are right about the network calls being expensive! But when you say:
So long as the queries are returning forms, we should be confident that they're valid...
you know for some languages, it doesn't return any result. "No matching records" is always the output if wikidata does not have data for that language.
Also please, could you explain:
and it's also on us to check them as they're being added
Also, I totally agree with the monthly checks. If we can get like a procedure that:
Implements a monthly scheduled script that checks for any missing languages by comparing the
language_data_extraction
directory with the metadata file. Itβll also validate that all QIDs in the metadata still correspond to actual languages on Wikidata, based on their properties.
I am thinking here that the language_metadata.json
will be important if we want to have more control.
Can we not ensure that the metadata
file has the languages we need? I'm thinking of a more local approach so we have more control and ignore expensive network calls.
How best do you think we can have all the languages that exist in language_data_extraction
, their QIDs
and ISOs
?
hi @andrewtavis , thanks for your input! You are right about the network calls being expensive! But when you say:
So long as the queries are returning forms, we should be confident that they're valid...
you know for some languages, it doesn't return any result. "No matching records" is always the output if wikidata does not have data for that language.
Also please, could you explain:
and it's also on us to check them as they're being added
Also, I totally agree with the monthly checks. If we can get like a procedure that:
Implements a monthly scheduled script that checks for any missing languages by comparing the
language_data_extraction
directory with the metadata file. Itβll also validate that all QIDs in the metadata still correspond to actual languages on Wikidata, based on their properties.I am thinking here that the
language_metadata.json
will be important if we want to have more control. Can we not ensure that themetadata
file has the languages we need? I'm thinking of a more local approach so we have more control and ignore expensive network calls.How best do you think we can have all the languages that exist in
language_data_extraction
, theirQIDs
andISOs
?
Hi @DeleMike. I think this issue https://github.com/scribe-org/Scribe-Data/issues/340 will help make sure the languages between language_metadata.json and language_data_extraction are matching. QIDs and ISOs are out of its scope tho.
Yeah @catreedle. It would help. And we need to make sure QIDs are present
And since, this issue is dependent on ensuring we have metadata, should we turn our attention to #340 first?
Also please, could you explain:
and it's also on us to check them as they're being added
We need to be doing a good job as a community to make sure that we're merging in correct code, but we should have checks in place :)
How best do you think we can have all the languages that exist in
language_data_extraction
, theirQIDs
andISOs
?
We'd need some kind of source of truth to check against for this, and Wikidata would be the only logical one. I think it could make sense to do a monthly check of the language metadata file and make sure that what we get back for the label, iso and other language metadata for the QIDs we have is sensible.
Yeah @catreedle. It would help. And we need to make sure QIDs are present
And since, this issue is dependent on ensuring we have metadata, should we turn our attention to #340 first?
we could, if you think it's suitable :) also are you suggesting we add check to make sure QIDs are present for each language for #340 ?
Yes, please. QIDs and ISOs are important. Asking if we could move to that issue, I would like us to hear @andrewtavis thoughts. I believe this issue is related to #293. We need the metadata file to be updated / the issue completed to know our next moves.
However, I'm searching for other options
Yes we need the files, but if you all want to open a PR then we could then edit it after the fact :) No stress on waiting as there shouldn't be too many changes.
Where do we want to go from here with this, all? :)
Terms
Description
This issue would create a new workflow in .github/workflows called
check_query_identifiers.yaml
that would call a Python script that would check all queries within thelanguage_data_extraction
directory to make sure that the identifiers used within them are appropriate. We can put these scripts in a new .github/workflows directory calledcheck
. The scripts would be:/src/scribe_data/check/check_query_identifiers.py
would check all queries in thelanguage_data_extraction
directory for two things?lexeme dct:language wd:Q12345
in the query appropriate given the directory that it's in?wikibase:lexicalCategory wd:Q12345
?Queries that fail these conditions should be added to a list and shown to the user in an output of the script and thus the workflow. Something like:
A code snippet that could help with this comes from #330:
Contribution
Happy to support, answer questions and review as needed!
CC @DeleMike and @catreedle :)