nvkp / turtle

Golang package for parsing and serializing the Turtle (.ttl) format used for representing RDF data
MIT License
3 stars 0 forks source link

Incorrect parsing of turtle file #9

Closed jonnyschaefer closed 3 months ago

jonnyschaefer commented 4 months ago

Hello,

thank you for providing this package. I noticed that it does not parse https://schema.org/version/latest/schemaorg-current-https.ttl correctly.

The following code shows triples, that don't have a schema.org IRI as subject, i.e. places where Subject/Predicate/Object fields are confused:

package main

import (
    "github.com/nvkp/turtle"
    "os"
    "fmt"
    "log"
    "strings"
)

func main() {
    var triples = []struct {
        Subject   string `turtle:"subject"`
        Predicate string `turtle:"predicate"`
        Object    string `turtle:"object"`
    }{}

    // https://schema.org/version/latest/schemaorg-current-https.ttl
    file, err := os.ReadFile("schemaorg-current-https.ttl")
    if err != nil {
        log.Fatal(err)
    }

    err = turtle.Unmarshal(file, &triples)
    if err != nil {
        log.Fatal(err)
    }

    for _, t := range triples {
        if !strings.HasPrefix(t.Subject, "https") {
            fmt.Println(t)
        }
    }
}

Output:

{Amazing Spider-Man" or "Groo the}
{Amazing http://www.w3.org/2000/01/rdf-schema#/subClassOf https://schema.org/Periodical}
{Amazing https://schema.org/isPartOf https://bib.schema.org}
...

I have not yet found the issue, but if it seems to break with this entry:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <https://schema.org/> .

schema:ComicSeries a rdfs:Class ;
    rdfs:label "ComicSeries" ;
    rdfs:comment """A sequential publication of comic stories under a
        unifying title, for example "The Amazing Spider-Man" or "Groo the
        Wanderer".""" ;
    rdfs:subClassOf schema:Periodical ;
    schema:isPartOf <https://bib.schema.org> .
Subject:   https://schema.org/ComicSeries
Predicate: http://www.w3.org/1999/02/22-rdf-syntax-ns#type
Object:    http://www.w3.org/2000/01/rdf-schema#/Class

Subject:   https://schema.org/ComicSeries
Predicate: http://www.w3.org/2000/01/rdf-schema#/label
Object:    ComicSeries

Subject:   https://schema.org/ComicSeries
Predicate: http://www.w3.org/2000/01/rdf-schema#/comment
Object:    A sequential publication of comic stories under a
        unifying title, for example "The

Subject:   Amazing
Predicate: Spider-Man" or "Groo
Object:    the

Subject:   Amazing
Predicate: http://www.w3.org/2000/01/rdf-schema#/subClassOf
Object:    https://schema.org/Periodical

Subject:   Amazing
Predicate: https://schema.org/isPartOf
Object:    https://bib.schema.org

I hope that this helps improving this package.

nvkp commented 4 months ago

Hello, thank You very much for finding this out!!!

There is multiple issues and I am currently working on fixing them all. Will let you know when the fix is ready.

nvkp commented 4 months ago

obrazek

The whole file that you provided should now be parsed fine when using the latest tag v1.0.3. Could you please try out?

jonnyschaefer commented 3 months ago

Hello,

Thank you. I really appreciate the fix. I can now parse the file completely. What I noticed is that when outputting the triples, some IRI fragments start with a slash, and some do not. E.g. http://www.w3.org/1999/02/22-rdf-syntax-ns#type and http://www.w3.org/1999/02/22-rdf-syntax-ns#/Property. I do not think that the / is correct here, but I am not too familiar with the turtle standard.

@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <https://schema.org/> .

schema:identifier a rdf:Property ;
    rdfs:label "identifier" ;
    rdfs:comment """The identifier property represents any kind of identifier for any kind of [[Thing]], such as ISBNs, GTIN codes, UUIDs etc. Schema.org provides dedicated properties for representing many of these, either as textual strings or as URL (URI) links. See [background notes](/docs/datamodel.html#identifierBg) for more details.
        """ ;
    owl:equivalentProperty dcterms:identifier ;
    schema:domainIncludes schema:Thing ;
    schema:rangeIncludes schema:PropertyValue,
        schema:Text,
        schema:URL .

results in

Subject:   https://schema.org/identifier
Predicate: http://www.w3.org/1999/02/22-rdf-syntax-ns#type
Object:    http://www.w3.org/1999/02/22-rdf-syntax-ns#/Property

Subject:   https://schema.org/identifier
Predicate: http://www.w3.org/2000/01/rdf-schema#/label
Object:    identifier

...