ozekik / lightrdf

A fast and lightweight Python RDF parser which wraps bindings to Rust's Rio using PyO3
Apache License 2.0
28 stars 2 forks source link

Incorrect parsing #7

Open kik0908 opened 3 years ago

kik0908 commented 3 years ago

Hi @ozekik!

I found a bug when parsing. I considered generations.rdf file when parsing, but a similar bug appeared in many other files. For the some reason the library recognizes this tag

<ns2:versionInfo rdf:datatype="http://www.w3.org/2001/XMLSchema#string">An example ontology created by Matthew Horridge</ns2:versionInfo>

like this str

'"An example ontology created by Matthew Horridge"^^<http://www.w3.org/2001/XMLSchema#string>'

in last item of triple ( triple[-1] ).

When using the rdflib library, I was not getting a similar problem. Thanks.

ozekik commented 3 years ago

Thank you for reporting!

In fact, so far, it is intentional so that users can handle kinds of literals (string, date, etc.) by their own. For example:

import datetime
import re

def parse_literal(literal):
    m = re.match(r'"(.*)"\^\^<(.*)>', literal)
    if m.group(2) == "http://www.w3.org/2001/XMLSchema#string":
        return m.group(1)
    elif m.group(2) == "http://www.w3.org/2001/XMLSchema#date":
        return datetime.date.fromisoformat(m.group(1))
    else:
        raise Exception

lit1 = '"An example ontology created by Matthew Horridge"^^<http://www.w3.org/2001/XMLSchema#string>'
lit2 = '"2021-08-11"^^<http://www.w3.org/2001/XMLSchema#date>'

print(parse_literal(lit1))
# An example ontology created by Matthew Horridge

print(parse_literal(lit2).ctime())
# Wed Aug 11 00:00:00 2021

On the other hand, RDFLib separates values and datatypes of literals by wrapping a literal in rdflib.term.Literal (and an IRI in rdflib.term.URIRef). We may take the same approach as RDFLib, but in exchange for introducing some complexity of learning compared to plain strings (I'm on the fence).