Closed marc-portier closed 6 months ago
we noticed that some ttl files contain references to doi URI that are not valid
for example:
contains:
schema:accessURL <http://dx.doi.org/10.1656/1092-6194(2004)11[261:CBIAHC]2.0.CO;2> ;
these URI however get accepted by rdflib.parse, turtle validators, ... only to produce errors when trying to insert them into the graphdb
we might consider cleaning these graphs with code like this:
# suggested pyrdfstore/tools.py from rdflib import Graph, URIRef, Literal, BNode from urllib.parse import quote import validators def check_valid_uri(uri: str) -> bool: return bool(validators.url(uri)) def clean_uri(uri: str) -> str: return quote(uri, safe='~@#$&()*!+=:;,?/\'') def clean_node(ref: URIRef | BNode | Literal) -> URIRef | BNode | Literal: if not isinstance(ref, URIRef): return ref # nothing to do if not URIRef # else uri = str(ref) if check_valid_uri(uri): return ref # nothing to do if uri is valid # else return URIRef(clean_uri(uri)) def clean_graph(bgraph: Graph) -> Graph: cgraph: Graph = Graph() for btriple in bgraph.triples(tuple((None, None, None))): # all triples ctriple = tuple((clean_node(node) for node in btriple)) cgraph.add(ctriple) return cgraph
Some remaining questions though:
from pyrdfstore.tools import clean_graph
https://github.com/vliz-be-opsci/py-RDF-store/pull/48 => pull request for this issue
closed in PR #49
we noticed that some ttl files contain references to doi URI that are not valid
for example:
contains:
schema:accessURL <http://dx.doi.org/10.1656/1092-6194(2004)11[261:CBIAHC]2.0.CO;2> ;
these URI however get accepted by rdflib.parse, turtle validators, ... only to produce errors when trying to insert them into the graphdb
we might consider cleaning these graphs with code like this:
Some remaining questions though:
from pyrdfstore.tools import clean_graph