scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
30 stars 69 forks source link

Add workflow to check queries #339

Closed andrewtavis closed 3 weeks ago

andrewtavis commented 1 month ago

Terms

Description

This issue would create a new workflow in .github/workflows called check_query_identifiers.yaml that would call a Python script that would check all queries within the language_data_extraction directory to make sure that the identifiers used within them are appropriate. We can put these scripts in a new .github/workflows directory called check. The scripts would be:

Queries that fail these conditions should be added to a list and shown to the user in an output of the script and thus the workflow. Something like:

There are queries that have incorrect language or data type identifiers.

Queries with incorrect languages QIDs are:
- English/nouns/query_nouns.sparql
- ...

Queries with incorrect data type QIDs are:
- English/nouns/query_nouns.sparql  # i.e. a single file should be able to appear in both
- French/verbs/query_verbs_1.sparql
- ...

A code snippet that could help with this comes from #330:

def extract_qid_from_sparql(file_path: Path) -> str:
    """
    Extract the QID from the specified SPARQL file.
    Args:
        file_path (Path): Path to the SPARQL file.
    Returns:
        str | None: The extracted QID or None if not found.
    """
    try:
        with open(file_path, "r", encoding="utf-8") as file:
            content = file.read()
            # Use regex to find the QID (e.g., wd:Q34311)
            match = re.search(r"wd:Q\d+", content)
            if match:
                return match.group(0).replace("wd:", "")  # Return the found QID
    except Exception as _:
        pass
        # print(f"Error reading {file_path}: {e}")
    return None  # Return None if not found or an error occurs

Contribution

Happy to support, answer questions and review as needed!

CC @DeleMike and @catreedle :)

KesharwaniArpita commented 1 month ago

Hi, @andrewtavis , @DeleMike and @catreedle, Can I also contribute to this issue?

DeleMike commented 1 month ago

Thanks @andrewtavis. I would love to be assigned to this issue. I would get started on it soon :)

andrewtavis commented 1 month ago

Let's definitely let @DeleMike and @catreedle do the PRs for here and #340, @KesharwaniArpita, as @DeleMike was the writer of the snippets and @catreedle did the initial reviews :) I'll let them say if they want support here, but maybe you could do #341?

DeleMike commented 1 month ago

I was thinking, @catreedle , do you think we could work together on this issue?

we could break this into two PRs, one for checking language appropriateness and the other for data type appropriateness? is this okay? @andrewtavis

I was thinking that #341 could be suited for @KesharwaniArpita ? ... it might be easier since @KesharwaniArpita was not in our initial discussions. That issue seems self-explanatory. what do you think? @catreedle

How about this @andrewtavis ?

KesharwaniArpita commented 1 month ago

Ok, I get it. Thanks for telling me about the discussion ☺️

DeleMike commented 1 month ago

Let's definitely let @DeleMike and @catreedle do the PRs for here and #340, @KesharwaniArpita, as @DeleMike was the writer of the snippets and @catreedle did the initial reviews :) I'll let them say if they want support here, but maybe you could do #341?

Ah yes! I did not see this!! You are right. we'll wait for feedback from @catreedle if she's comfortable with this :)

andrewtavis commented 1 month ago

Assigning @KesharwaniArpita in so far as it'd be great if you all would discuss the implementation together, but as @KesharwaniArpita's on #341 my assumption is that the coding for this and #340 are done by @DeleMike and @catreedle 😊

KesharwaniArpita commented 1 month ago

I'll happily be the learner here!!! πŸ˜ƒ 😁 Thanks for considering me!!!!

catreedle commented 1 month ago

I was thinking, @catreedle , do you think we could work together on this issue?

we could break this into two PRs, one for checking language appropriateness and the other for data type appropriateness? is this okay? @andrewtavis

I was thinking that #341 could be suited for @KesharwaniArpita ? ... it might be easier since @KesharwaniArpita was not in our initial discussions. That issue seems self-explanatory. what do you think? @catreedle

How about this @andrewtavis ?

Good idea, @DeleMike. :) Which one are you working on? I'll take the other.

catreedle commented 1 month ago

Thank you for the learning opportunity, @andrewtavis @DeleMike @KesharwaniArpita 😊

DeleMike commented 1 month ago

Great @catreedle! How about this? Let's work on two PRs that will resolve this issue. This is how I propose:

1. Create the Workflow: check_query_identifiers.yaml

The first PR will be about setting up the GitHub Actions workflow file (check_query_identifiers.yaml).

Key tasks:

  1. Create the workflow inside the .github/workflows/ directory.
  2. Define triggers for the workflow (e.g., on push or pull_request).
  3. Call the Python script check_query_identifiers.py from this workflow.
  4. Ensure that the workflow captures output from the script and shows it clearly in the GitHub Actions logs.

2. Implement the Python Script: check_query_identifiers.py

This is about writing the Python script that performs the actual checks

Key tasks:

  1. Perform the following checks: a) is the language QID (?lexeme dct:language wd:Q12345) correct based on the directory the file is in? b) is the data type QID (for nouns, verbs, etc.) correct according to the query?
  2. Collect a list of files that fail the checks and print a user-friendly error message.

Output format:

The script should show a list of queries that have incorrect QIDs, either for the language or data type.

which ones would you like to work on? I would prefer to work on implementing the check_query_identifiers.py but any is preferable for me. I have a fair understanding of GH Actions - so that'd be nice!

@andrewtavis, do you have any suggestions about this?

DeleMike commented 1 month ago

all PRs will be linked to this issue. But I believe it doesn't matter what number we choose to work with, we can work in parallel(to increase speed) and once we are done we can collaborate to ensure the workflow file connects with the method that does the checks in check_query_identifiers.py.

In summary, what I mean is, Task 1 depends on Task 2 to be complete but it can be set up without the need of Task 2. do you understand @catreedle ?

andrewtavis commented 1 month ago

All sounds good to me, @DeleMike :) We'll just need to make sure to bring in the .py file before we deploy the workflow 😊

catreedle commented 1 month ago

Yes, @DeleMike. I can work on the GitHub workflow. Thank you for the clear explanation. 😊

DeleMike commented 1 month ago

Sure! Alright then! I'll raise a PR containing the checks.

catreedle commented 1 month ago

Opened a PR. Feedbacks will be much appreciated. :)

andrewtavis commented 1 month ago

Comment made :) Thanks much, @catreedle!

andrewtavis commented 1 month ago

Another thing to note here all is whether we can write something to really verify the queries are correct at some point - so not in this issue. We're mostly using the same properties for each of the forms we're getting, but in various combinations. Maybe it would make sense to really do a naming convention for the forms, and then from there we can say that a form named in a certain way should be coming from wikibase:grammaticalFeature QIDs that are this. I.e. FPSPresentIndicative would have first person, singular, present and indicative in its OPTIONAL block.

100% a different issue and likely a lot of work, but could be cool 😊

DeleMike commented 1 month ago

Yes, you are right @andrewtavis. We'll need to have a form of standardization. I think it might require some more extra time, especially when it comes to setting up a proper convention.

andrewtavis commented 1 month ago

Very much so, but it'd be really beneficial once we figure it out :)

DeleMike commented 1 month ago

hey @andrewtavis, I have a question...now, I want to implement two checks, one for valid data type and also valid language.

to check for a valid data type, we can check the data_type_metadata.json file because it looks like:

{
  "adjectives": "Q34698",
  "adverbs": "Q380057",
  "articles": "Q103184",
  "autosuggestions": "",
  "conjunctions": "Q36484",
  "emoji_keywords": "",
  "nouns": "Q1084",
  "personal_pronouns": "Q468801",
  "postpositions": "Q161873",
  "prepositions": "Q4833830",
  "pronouns": "Q36224",
  "proper_nouns": "Q147276",
  "verbs": "Q24905"
}

but for language_metadata.json, we don't have that(I mean, we have but we cannot rely on it). The file will almost not always have our complete languages in the language_data_extraction directory.

Now, how do we verify if a language QID is valid or not? do we need extra research or maintain a dictionary of language:QID pairs?

I believe data_type_metadata.json is manageable and easy to update via hand.

what are your thoughts?

DeleMike commented 1 month ago

I'm also thinking @andrewtavis , about this:

expected_language_qid = language_directory_qid_map.get(directory_name) # might not work since language_metadata file is not full updated
expected_language_qid = fetch_language_qid_from_wikidata(directory_name) # might be reliable if the user has an internet connection

what's the best approach here?


Also see the validation check for valid_data_type:

def is_valid_data_type(query_file, data_type_qid):
    directory_name = query_file.parent.name  # e.g., "nouns" or "verbs"
    expected_data_type_qid = data_type_metadata.get(directory_name)

    if data_type_qid != expected_data_type_qid:
        print(f"Incorrect data type QID in {query_file}. Found: {data_type_qid}, Expected: {expected_data_type_qid}")
        return False
    return True

from the above, creating checks to know if a data_type is okay for the directory it is under is done βœ… So if we have the file English/nouns/query_nouns.sparql and if we put in a wrong QID, then we will throw something like this and return False

Incorrect data type QID in English/nouns/query_nouns.sparql. Found: Q1086, Expected: Q1084
andrewtavis commented 1 month ago

hey @andrewtavis, I have a question...now, I want to implement two checks, one for valid data type and also valid language.

to check for a valid data type, we can check the data_type_metadata.json file because it looks like:


{

  "adjectives": "Q34698",

  "adverbs": "Q380057",

  "articles": "Q103184",

  "autosuggestions": "",

  "conjunctions": "Q36484",

  "emoji_keywords": "",

  "nouns": "Q1084",

  "personal_pronouns": "Q468801",

  "postpositions": "Q161873",

  "prepositions": "Q4833830",

  "pronouns": "Q36224",

  "proper_nouns": "Q147276",

  "verbs": "Q24905"

}

but for language_metadata.json, we don't have that(I mean, we have but we cannot rely on it). The file will almost not always have our complete languages in the language_data_extraction directory.

Now, how do we verify if a language QID is valid or not? do we need extra research or maintain a dictionary of language:QID pairs?

I've been thinking about this as well, @DeleMike :) A basic thing that we can do is use Wikidata for this to check that the QID is for a language given properties, but then running a workflow check based on an external service might not be the best, and it's again using it in a way that's not ideal as we would be making lots of relatively unneeded calls. So long as the queries are returning forms, we should be confident that they're valid, and it's also on us to check them as they're being added. The checks here will also make sure that all QIDs for a single language are the same.

If we really wanted to be sure, we could write say a monthly workflow that checks that all the QIDs we're using are languages given properties on Wikidata and we could also check that the labels that we have for them are similar to their labels on Wikidata - i.e. is our label included in the Wikidata label or if not semantically similar to it. I wouldn't run this more than once a month though to not overuse resources :)

Let me know what you think on the above!

DeleMike commented 1 month ago

hi @andrewtavis , thanks for your input! You are right about the network calls being expensive! But when you say:

So long as the queries are returning forms, we should be confident that they're valid...

you know for some languages, it doesn't return any result. "No matching records" is always the output if wikidata does not have data for that language.

Also please, could you explain:

and it's also on us to check them as they're being added


Also, I totally agree with the monthly checks. If we can get like a procedure that:

Implements a monthly scheduled script that checks for any missing languages by comparing the language_data_extraction directory with the metadata file. It’ll also validate that all QIDs in the metadata still correspond to actual languages on Wikidata, based on their properties.

I am thinking here that the language_metadata.json will be important if we want to have more control. Can we not ensure that the metadata file has the languages we need? I'm thinking of a more local approach so we have more control and ignore expensive network calls.

How best do you think we can have all the languages that exist in language_data_extraction, their QIDs and ISOs?

catreedle commented 1 month ago

hi @andrewtavis , thanks for your input! You are right about the network calls being expensive! But when you say:

So long as the queries are returning forms, we should be confident that they're valid...

you know for some languages, it doesn't return any result. "No matching records" is always the output if wikidata does not have data for that language.

Also please, could you explain:

and it's also on us to check them as they're being added

Also, I totally agree with the monthly checks. If we can get like a procedure that:

Implements a monthly scheduled script that checks for any missing languages by comparing the language_data_extraction directory with the metadata file. It’ll also validate that all QIDs in the metadata still correspond to actual languages on Wikidata, based on their properties.

I am thinking here that the language_metadata.json will be important if we want to have more control. Can we not ensure that the metadata file has the languages we need? I'm thinking of a more local approach so we have more control and ignore expensive network calls.

How best do you think we can have all the languages that exist in language_data_extraction, their QIDs and ISOs?

Hi @DeleMike. I think this issue https://github.com/scribe-org/Scribe-Data/issues/340 will help make sure the languages between language_metadata.json and language_data_extraction are matching. QIDs and ISOs are out of its scope tho.

DeleMike commented 1 month ago

Yeah @catreedle. It would help. And we need to make sure QIDs are present

And since, this issue is dependent on ensuring we have metadata, should we turn our attention to #340 first?

andrewtavis commented 1 month ago

Also please, could you explain:

and it's also on us to check them as they're being added

We need to be doing a good job as a community to make sure that we're merging in correct code, but we should have checks in place :)

How best do you think we can have all the languages that exist in language_data_extraction, their QIDs and ISOs?

We'd need some kind of source of truth to check against for this, and Wikidata would be the only logical one. I think it could make sense to do a monthly check of the language metadata file and make sure that what we get back for the label, iso and other language metadata for the QIDs we have is sensible.

catreedle commented 1 month ago

Yeah @catreedle. It would help. And we need to make sure QIDs are present

And since, this issue is dependent on ensuring we have metadata, should we turn our attention to #340 first?

we could, if you think it's suitable :) also are you suggesting we add check to make sure QIDs are present for each language for #340 ?

DeleMike commented 1 month ago

Yes, please. QIDs and ISOs are important. Asking if we could move to that issue, I would like us to hear @andrewtavis thoughts. I believe this issue is related to #293. We need the metadata file to be updated / the issue completed to know our next moves.

However, I'm searching for other options

andrewtavis commented 1 month ago

Yes we need the files, but if you all want to open a PR then we could then edit it after the fact :) No stress on waiting as there shouldn't be too many changes.

andrewtavis commented 3 weeks ago

Where do we want to go from here with this, all? :)