scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
26 stars 61 forks source link

Add workflow to check queries #339

Open andrewtavis opened 2 hours ago

andrewtavis commented 2 hours ago

Terms

Description

This issue would create a new workflow in .github/workflows called check_query_identifiers.yaml that would call a Python script that would check all queries within the language_data_extraction directory to make sure that the identifiers used within them are appropriate. We can put these scripts in a new .github/workflows directory called check. The scripts would be:

Queries that fail these conditions should be added to a list and shown to the user in an output of the script and thus the workflow. Something like:

There are queries that have incorrect language or data type identifiers.

Queries with incorrect languages QIDs are:
- English/nouns/query_nouns.sparql
- ...

Queries with incorrect data type QIDs are:
- English/nouns/query_nouns.sparql  # i.e. a single file should be able to appear in both
- French/verbs/query_verbs_1.sparql
- ...

A code snippet that could help with this comes from #330:

def extract_qid_from_sparql(file_path: Path) -> str:
    """
    Extract the QID from the specified SPARQL file.
    Args:
        file_path (Path): Path to the SPARQL file.
    Returns:
        str | None: The extracted QID or None if not found.
    """
    try:
        with open(file_path, "r", encoding="utf-8") as file:
            content = file.read()
            # Use regex to find the QID (e.g., wd:Q34311)
            match = re.search(r"wd:Q\d+", content)
            if match:
                return match.group(0).replace("wd:", "")  # Return the found QID
    except Exception as _:
        pass
        # print(f"Error reading {file_path}: {e}")
    return None  # Return None if not found or an error occurs

Contribution

Happy to support, answer questions and review as needed!

CC @DeleMike and @catreedle :)

KesharwaniArpita commented 2 hours ago

Hi, @andrewtavis , @DeleMike and @catreedle, Can I also contribute to this issue?

DeleMike commented 2 hours ago

Thanks @andrewtavis. I would love to be assigned to this issue. I would get started on it soon :)

andrewtavis commented 2 hours ago

Let's definitely let @DeleMike and @catreedle do the PRs for here and #340, @KesharwaniArpita, as @DeleMike was the writer of the snippets and @catreedle did the initial reviews :) I'll let them say if they want support here, but maybe you could do #341?

DeleMike commented 2 hours ago

I was thinking, @catreedle , do you think we could work together on this issue?

we could break this into two PRs, one for checking language appropriateness and the other for data type appropriateness? is this okay? @andrewtavis

I was thinking that #341 could be suited for @KesharwaniArpita ? ... it might be easier since @KesharwaniArpita was not in our initial discussions. That issue seems self-explanatory. what do you think? @catreedle

How about this @andrewtavis ?

KesharwaniArpita commented 2 hours ago

Ok, I get it. Thanks for telling me about the discussion ☺️

DeleMike commented 2 hours ago

Let's definitely let @DeleMike and @catreedle do the PRs for here and #340, @KesharwaniArpita, as @DeleMike was the writer of the snippets and @catreedle did the initial reviews :) I'll let them say if they want support here, but maybe you could do #341?

Ah yes! I did not see this!! You are right. we'll wait for feedback from @catreedle if she's comfortable with this :)

andrewtavis commented 2 hours ago

Assigning @KesharwaniArpita in so far as it'd be great if you all would discuss the implementation together, but as @KesharwaniArpita's on #341 my assumption is that the coding for this and #340 are done by @DeleMike and @catreedle 😊

KesharwaniArpita commented 2 hours ago

I'll happily be the learner here!!! πŸ˜ƒ 😁 Thanks for considering me!!!!