vliz-be-opsci / py-RDF-store

module to interact with a memory or uristore
0 stars 0 forks source link

consider feature to cleanup possible bad URIRef #46

Closed marc-portier closed 6 months ago

marc-portier commented 7 months ago

we noticed that some ttl files contain references to doi URI that are not valid

for example:

contains:

these URI however get accepted by rdflib.parse, turtle validators, ... only to produce errors when trying to insert them into the graphdb

we might consider cleaning these graphs with code like this:

# suggested pyrdfstore/tools.py
from rdflib import Graph, URIRef, Literal, BNode
from urllib.parse import quote
import validators

def check_valid_uri(uri: str) -> bool:
    return bool(validators.url(uri))

def clean_uri(uri: str) -> str:
    return quote(uri, safe='~@#$&()*!+=:;,?/\'')

def clean_node(ref: URIRef | BNode | Literal) -> URIRef | BNode | Literal:
    if not isinstance(ref, URIRef):
        return ref  # nothing to do if not URIRef
    # else
    uri = str(ref)
    if check_valid_uri(uri):
        return ref  # nothing to do if uri is valid
    # else
    return URIRef(clean_uri(uri))

def clean_graph(bgraph: Graph) -> Graph:
    cgraph: Graph = Graph()
    for btriple in bgraph.triples(tuple((None, None, None))):  # all triples
        ctriple = tuple((clean_node(node) for node in btriple))
        cgraph.add(ctriple)
    return cgraph

Some remaining questions though:

cedricdcc commented 7 months ago

https://github.com/vliz-be-opsci/py-RDF-store/pull/48 => pull request for this issue

marc-portier commented 6 months ago

closed in PR #49