innoq / iqvoc

iQvoc - A SKOS(-XL) Vocabulary Management System for the Semantic Web
http://iqvoc.net/
Other
118 stars 44 forks source link

SKOS importer doesn't like special characters #346

Open mgbeyer opened 9 years ago

mgbeyer commented 9 years ago

If the subject part of an N-Triple line contains characters like slash (/) or hash (#) the importer will reject them (example: "WARN -- : SkosImporter: Invalid origin. Skipping :concept/#Abbreviations rdf:type skos:concept"). But characters like / or # are normal parts of an URI. For example one of our thesauri we'd like to import to iQvoc contains multiple levels beyond the context path set by the default namespace to distinguish between actual concepts and personal classes and properties (among others). Then if you strip the leading default namespace from a subject string (like the importer does) the remaining part of the URI still contains slashes and will be rejected by the importer.

Generally an URI should be granted to contain UTF-8 conform special characters to allow for regional character sets. So I wonder why the importer actively rejects characters beyond the minimal set of " a-zA-Z0-9_.-"? Was it a deliberate design decision with a sound purpose and I'm missing a point here? If you maybe could elaborate on that a little I would greatly appreciate it.

mjansing commented 9 years ago

I can't reproduce the problem. Please provide more information about the imported triples. The fragment identifier should be the last part of an uri (after filename, your leading slash looks a bit curious).

mgbeyer commented 9 years ago

Thanks for the reply!

I don't know what you mean by "after filename"...what filename? Anyway, here's more detailed information about what we're trying to import (sorry this is a bit lengthy :))

The (stripped-down) N-Triples file:

<http://lod.gesis.org/thesoz/classification/0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
<http://lod.gesis.org/thesoz/classification/0> <http://www.w3.org/2004/02/skos/core#inScheme> <http://lod.gesis.org/thesoz/> .
<http://lod.gesis.org/thesoz/classification/0> <http://www.w3.org/2004/02/skos/core#prefLabel> "Grundlagen der Sozialwissenschaften\u00A00"@de .
<http://lod.gesis.org/thesoz/classification/0> <http://www.w3.org/2004/02/skos/core#prefLabel> "Fundamentals of the Social Sciences\u00A00"@en .
<http://lod.gesis.org/thesoz/classification/0> <http://www.w3.org/2004/02/skos/core#prefLabel> "'fondements des sciences sociales\u00A00"@fr .
<http://lod.gesis.org/thesoz/classification/0> <http://www.w3.org/2004/02/skos/core#notation> "0"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://lod.gesis.org/thesoz/classification/1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
<http://lod.gesis.org/thesoz/classification/1> <http://www.w3.org/2004/02/skos/core#inScheme> <http://lod.gesis.org/thesoz/> .
<http://lod.gesis.org/thesoz/classification/1> <http://www.w3.org/2004/02/skos/core#prefLabel> "Grundlagen der Sozialwissenschaften\u00A00"@de .
<http://lod.gesis.org/thesoz/classification/1> <http://www.w3.org/2004/02/skos/core#prefLabel> "Fundamentals of the Social Sciences\u00A00"@en .
<http://lod.gesis.org/thesoz/classification/1> <http://www.w3.org/2004/02/skos/core#prefLabel> "'fondements des sciences sociales\u00A00"@fr .
<http://lod.gesis.org/thesoz/classification/1> <http://www.w3.org/2004/02/skos/core#notation> "0"^^<http://www.w3.org/2001/XMLSchema#string> .

What seems to be the problem

We're using NAMESPACE='http://lod.gesis.org/thesoz/' as the default, so the remaining subjects will still contain a slash (like "classification/0"). I'm aware that if we expand the namespace to "http://lod.gesis.org/thesoz/classification/" we're facing subjects, starting with a number, which is also not approved by the importer for reasons unclear (see the validator method in the Origin class (/app/aides/origin.rb)). So basically we're talking about this code-fragment in the validator method of the Origin class:

    # should not start with a number
    valid = false if initial_value.match(/^\d.*/)

    # should not contain special chars
    valid = false if CGI.escape(initial_value) != initial_value

Ok, now here's the output:

I, [2015-07-16T11:44:58.282643 #14596]  INFO -- : Known namespaces:
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    1: skos: => http://www.w3.org/2004/02/skos/core#
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    2: skos: => http://www.w3.org/2008/05/skos#
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    3: rdf: => http://www.w3.org/1999/02/22-rdf-syntax-ns#
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    4: : => http://lod.gesis.org/thesoz/
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    5: rdfs: => http://www.w3.org/2000/01/rdf-schema#
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    6: owl: => http://www.w3.org/2002/07/owl#
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    7: dct: => http://purl.org/dc/terms/
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    8: foaf: => http://xmlns.com/foaf/spec/
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    9: void: => http://rdfs.org/ns/void#
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    10: iqvoc: => http://try.iqvoc.net/schema#
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- : Known first level classes:
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    1: skos:Concept => Concept::SKOS::Base
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    2: skos:Collection => Collection::SKOS::Unordered
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- : Known second level classes:
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    1: skos:prefLabel => Labeling::SKOS::PrefLabel
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    2: skos:altLabel => Labeling::SKOS::AltLabel
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    3: skos:changeNote => Note::SKOS::ChangeNote
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    4: skos:definition => Note::SKOS::Definition
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    5: skos:editorialNote => Note::SKOS::EditorialNote
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    6: skos:example => Note::SKOS::Example
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    7: skos:historyNote => Note::SKOS::HistoryNote
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    8: skos:scopeNote => Note::SKOS::ScopeNote
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    9: skos:related => Concept::Relation::SKOS::Related
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    10: skos:broader => Concept::Relation::SKOS::Broader::Mono
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    11: skos:narrower => Concept::Relation::SKOS::Narrower::Base
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    12: skos:closeMatch => Match::SKOS::CloseMatch
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    13: skos:exactMatch => Match::SKOS::ExactMatch
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    14: skos:relatedMatch => Match::SKOS::RelatedMatch
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    15: skos:broadMatch => Match::SKOS::BroadMatch
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    16: skos:narrowMatch => Match::SKOS::NarrowMatch
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    17: skos:notation => Notation::Base
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    18: skos:topConceptOf => Concept::SKOS::Scheme
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- :    19: skos:member => Collection::Member::SKOS::Base
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- : default namespace: 'http://lod.gesis.org/thesoz/'
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- : publish: 'true'
I, [2015-07-16T11:44:58.282643 #14596]  INFO -- : SkosImporter: Importing triples...
W, [2015-07-16T11:44:58.292643 #14596]  WARN -- : SkosImporter: Invalid origin. Skipping :classification/0 rdf:type skos:Concept
W, [2015-07-16T11:44:58.292643 #14596]  WARN -- : SkosImporter: Invalid origin. Skipping :classification/0 skos:inScheme :
W, [2015-07-16T11:44:58.292643 #14596]  WARN -- : SkosImporter: Invalid origin. Skipping :classification/0 skos:prefLabel "Grundlagen der Sozialwissenschaften\u00A00"@de
W, [2015-07-16T11:44:58.292643 #14596]  WARN -- : SkosImporter: Invalid origin. Skipping :classification/0 skos:prefLabel "Fundamentals of the Social Sciences\u00A00"@en
W, [2015-07-16T11:44:58.292643 #14596]  WARN -- : SkosImporter: Invalid origin. Skipping :classification/0 skos:prefLabel "'fondements des sciences sociales\u00A00"@fr
W, [2015-07-16T11:44:58.292643 #14596]  WARN -- : SkosImporter: Invalid origin. Skipping :classification/0 skos:notation "0"^^<http://www.w3.org/2001/XMLSchema#string>
W, [2015-07-16T11:44:58.292643 #14596]  WARN -- : SkosImporter: Invalid origin. Skipping :classification/1 rdf:type skos:Concept
W, [2015-07-16T11:44:58.292643 #14596]  WARN -- : SkosImporter: Invalid origin. Skipping :classification/1 skos:inScheme :
W, [2015-07-16T11:44:58.302643 #14596]  WARN -- : SkosImporter: Invalid origin. Skipping :classification/1 skos:prefLabel "Grundlagen der Sozialwissenschaften\u00A00"@de
W, [2015-07-16T11:44:58.302643 #14596]  WARN -- : SkosImporter: Invalid origin. Skipping :classification/1 skos:prefLabel "Fundamentals of the Social Sciences\u00A00"@en
W, [2015-07-16T11:44:58.302643 #14596]  WARN -- : SkosImporter: Invalid origin. Skipping :classification/1 skos:prefLabel "'fondements des sciences sociales\u00A00"@fr
W, [2015-07-16T11:44:58.302643 #14596]  WARN -- : SkosImporter: Invalid origin. Skipping :classification/1 skos:notation "0"^^<http://www.w3.org/2001/XMLSchema#string>
I, [2015-07-16T11:44:58.302643 #14596]  INFO -- : Computing 'forward' defined triples...
I, [2015-07-16T11:44:58.302643 #14596]  INFO -- : Basic import done (took 0 seconds).
I, [2015-07-16T11:44:58.302643 #14596]  INFO -- : Publishing 0 new subjects...
I, [2015-07-16T11:44:58.302643 #14596]  INFO -- : Publishing of 0 subjects done (took 0 seconds). 0 are in draft state.
I, [2015-07-16T11:44:58.302643 #14596]  INFO -- : Imported 0 published and 0 draft subjects in 0 seconds.
I, [2015-07-16T11:44:58.302643 #14596]  INFO -- : First step took 0 seconds, publishing took 0 seconds.

As I said: lengthy as hell, sorry :-) But I guess it'll help to clarify the problem...

mjansing commented 9 years ago

Thanks. I updated your comment with some formatting options. I'll check that.

mjansing commented 9 years ago

BTW

...we're facing subjects, starting with a number, which is also not approved by the importer for reasons unclear...

Origins should not start with a number so that iQvoc is able to generate a valid rdf/xml serialization. See RDF syntax grammar for details.