XML namespaces for XPath

DylanVanAssche commented 3 years ago

XPath allows to use XML namespaces when selecting parts of an XML document. However, (most) implementations require to register these namespaces before doing an XPath query. RML does not specify how does this should happen currently:

In the mapping rules?
By the implementation with a CLI parameter or dynamically by parsing the XML document first and find any namespaces
...

CARML has an extension for this: https://github.com/carml/carml#xml-namespace-extension and it came up in the past already a few times without a clear solution:

dachafra commented 2 years ago

@DylanVanAssche is more this a challenge or a "best-practice" than a pure problem with the RML spec? Shall we transfer the issue?

DylanVanAssche commented 2 years ago

@dachafra For me, it is a spec thing because it is related to the rml:iterator. Maybe a Literal is insufficient here?

dachafra commented 2 years ago

@DylanVanAssche So... seen as well the proposal from CARML, it is more related to the Logical Source, right? Do we transfer it to that spec?

DylanVanAssche commented 2 years ago

True! Fine for transferring it!

DylanVanAssche commented 2 years ago

@pmaria I like the CARML approach for this issue:

rml:logicalSource [
    rml:source [
      a carml:Stream ;
      # or in case of a file source use:
      # carml:url "path-to-source" ;
      carml:declaresNamespace [
        carml:namespacePrefix "ex" ;
        carml:namespaceName "http://www.example.com/books/1.0/" ;
      ] ;
    ] ;
    rml:referenceFormulation ql:XPath ;
    rml:iterator "/ex:bookstore/*" ;
  ] ;

What do you think of using this?

rml:logicalSource [
  rml:source [
    # Any kind of source
  ] ;
  rml:iterator [ a ql:XPathIterator, rml:Iterator;
    rml:namespaceName "http://www.example.com/books/1.0/" ;
    rml:namespacePrefix "ex" ;
    rml:value "/ex:bookstore/*";
  ];
]

Changes:

Make the iterator an object instead of literal, drop rml:referenceFormulation
Move namespaces to the iterator, especially the XPath iterator
For JSON, CSV, etc. we would have the same, just not the namespace stuff
In the future, reference formulation X appears which is totally different and needs some stuff like the namespaces as well, we can support it.

pmaria commented 2 years ago

Hmm I'm not sure the iterator is the most natural place to define the namespaces. Since you also want to be able to use these namespaces in non-iterator expressions.

DylanVanAssche commented 2 years ago

@pmaria

Hmm I'm not sure the iterator is the most natural place to define the namespaces. Since you also want to be able to use these namespaces in non-iterator expressions.

When you use rml:reference, rr:column, rr:template, etc. you take the rml:iterator value, append the value of one of these references to retrieve what you need in a Triples Map. That's why I found it a better fit there because if it specify for the reference formulation & iterator. rml:source is only for defining how a source should be accessed such as location. Because of that, I would keep the namespace declaration away from that since those namespaces are only used for executing the iterator & references during the data processing after the data was retrieved from the source.

pmaria commented 2 years ago

When you use rml:reference, rr:column, rr:template, etc. you take the rml:iterator value, append the value of one of these references to retrieve what you need in a Triples Map.

Ah I don't see it that way necessarily. I see the rml:iterator, rml:reference, rr:template conceptually operating within the same scope/context. Wherein indeed, the iterator creates an iteration of sub documents on which the other expressions are evaluated. But I see the iterator as just another expression.

But I agree that source might not be the best place for the NS definition, because it is essentially a query concern, and the namespaces don't need to match the namespaces used in a source document.

maybe it makes more sense then to add a new object to the logical source, next to the iterator? Similar to your idea, but keeping iterator as is, i.e. as just another expression.

Something like rml:ExpressionContext.

rml:logicalSource [
  rml:source [
    # Any kind of source
  ] ;
  rml:iterator "/ex:bookstore/*" ;
  rml:expressionContext [ a XPathExpressionContext;
    rml:namespace [
       rml:namespaceName "http://www.example.com/books/1.0/" ;
       rml:namespacePrefix "ex" ;
    ];
  ]
  rml:referenceFormulation ql:XPath;
]

We could possibly combine it with the reference formulation? The rationale would be that this defines how to interpret the expressions that are based on a logical source.

pmaria commented 2 years ago

So combining it with reference formulations could look like

rml:logicalSource [
  rml:source [
    # Any kind of source
  ] ;
  rml:iterator "/ex:bookstore/*" ;
  rml:referenceFormulation [ a ql:XPathReferenceFomulation;
    ql:namespace [
       ql:namespaceName "http://www.example.com/books/1.0/" ;
       ql:namespacePrefix "ex" ;
    ] ;
  ] ;
]

This would be a custom specified XPath reference formulation, next to the "default" ql:XPath.

DylanVanAssche commented 2 years ago

Ah I don't see it that way necessarily. I see the rml:iterator, rml:reference, rr:template conceptually operating within the same scope/context. Wherein indeed, the iterator creates an iteration of sub documents on which the other expressions are evaluated. But I see the iterator as just another expression.

Ah depends on how you implement the spec :) Some implementations do not create subdocuments. However, I agree with you :)

But I agree that source might not be the best place for the NS definition, because it is essentially a query concern, and the namespaces don't need to match the namespaces used in a source document.

Yes! I try to separate the concerns as much as possible so it also re-usable in the future.

rml:referenceFormulation definition:

The reference formulation (rml:referenceFormulation) defines the reference formulation used to refer to the elements of the data source. The reference formulation must be specified in the case of databases and XML and JSON data sources. By default SQL2008 for databases, as SQL2008 is the default for R2RML, XPath for XML and JSONPath for JSON data sources.

According to the definition, the last suggestion looks better to me. Are we aware of something similar for other reference formulations?

This would be a custom specified XPath reference formulation, next to the "default" ql:XPath.

Ideally, we don't even need that and have 1 IRI for both (with and without namespaces), but I'm not sure how to achieve that in RDF? Properties can be optional, but if you have none, it become something weird like this:

rml:referenceFormulation [ a ql:XPathReferenceFomulation; ] ;

We could 'solve' this by having shortcuts:

rml:referenceFormulation ql:XPath;

This shortcut points to [ a ql:XPathReferenceFomulation; ]. I think this is what you meant above with the "default"?

pmaria commented 2 years ago

We could 'solve' this by having shortcuts:
rml:referenceFormulation ql:XPath;

Yes. I see that rml:ReferenceFormulation is already defined in the RML ontology.

rml:referenceFormulation rdfs:range rml:ReferenceFormulation .

rml:ReferenceFormulation rdf:type owl:Class ;
    rdfs:label   "Reference Formulation" ;
    rdfs:comment "Represents a Reference Formulation."@en .

And also defined is

ql:XPath rdf:type owl:NamedIndividual, rml:ReferenceFormulation  ;
    rdfs:label   "XPath" ; 
    rdfs:comment "Denotes the XPath reference formulation, used for referring to extracts of XML sources."@en ;
    ql:specification <http://www.w3.org/TR/xpath20/> ;
    rml:version "2.0".

So essentially the "shortcut" is just using the named individual.

Now all we would have to do is introduce a subclass of rml:ReferenceFormulation , rml:XPathReferenceFormulation, and define that further, adding namespace properties.

I don't think we should introduce a new named individual for XPath with namespaces. This would limit the namespaces you could define, since the individual's scope would be global. And you might want to define different namespaces per logical source.

DylanVanAssche commented 2 years ago

@pmaria Alright! I agree, let's setup our battle plan then for this issue:

Introduce rml:XPathReferenceFormulation
Define ql:namespaceName and ql:namespacePrefix in there

Problem solved then?

pmaria commented 2 years ago

Yes I think so 🎉

Not forgetting ql:namespace to spec one or more ql:Namespaces

chrdebru commented 2 years ago

Why put namespace URIs in literals rather than using resources?

DylanVanAssche commented 2 years ago

@chrdebru

Why put namespace URIs in literals rather than using resources?

Spec: https://www.w3.org/TR/xml-names/

[URI references identifying namespaces are compared when determining whether a name belongs to a given namespace, and whether two names belong to the same namespace. []()[Definition](): The two URIs are treated as strings, and they are identical if and only if the strings are identical, that is, if they are the same sequence of characters. ] The comparison is case-sensitive, and no %-escaping is done or undone.

AFAIK, XML Namespaces are not like Linked Data and are compared through a string-based comparison without any resolving. That's why they are a Literal here, but any insights are welcome!

chrdebru commented 2 years ago

Yes, but they can also be regarded as named resources that can be described (no matter whether they dereference and resolve). Having those as resources would facilitate writing SPARQL queries and inverse property paths, for instance. Just a thought, not questioning the proposal.

I would suggest renaming ql:namespaceName to namespaceIRI. Some namespaces have titles and a namespace contains names. Turtle mentions this: "The '@prefix' or 'PREFIX' directive associates a prefix label with an IRI".

DylanVanAssche commented 2 years ago

Yes, but they can also be regarded as named resources that can be described (no matter whether they dereference and resolve). Having those as resources would facilitate writing SPARQL queries and inverse property paths, for instance. Just a thought, not questioning the proposal.

I don't have much experience with that regard, so if it helps, I don't mind :) For me, it doesn't really matter as long we have a mapping prefix <-> IRI

I would suggest renaming ql:namespaceName to namespaceIRI. Some namespaces have titles and a namespace contains names. Turtle mentions this: "The '@Prefix' or 'PREFIX' directive associates a prefix label with an IRI".

Hmmm true, twice 'name' might be a bit weird :) @pmaria Do you agree on this?

pmaria commented 2 years ago

Namespace name is what the spec calls it https://www.w3.org/TR/xml-names/#dt-NSName, so I would stick to that.

As far as I can tell we can't simply use IRIs, because XML expects URIs.

The main use case is to register the namespaces with an XPath engine for querying. Most implementations I've seen represent the namespace name as a string.

My feeling is that keeping it a string would be the more natural mapping to implementations, but if the arguments for using an IRI are strong I can live with that. We would however have to specify what happens when an IRI that is not a URI is used..

andimou commented 2 years ago

I don't disagree with @chrdebru but we can as well keep it ql:namespace, whether IRI/URI or Literal can be determined based on the range, we don't need to include it in the name of the property.

Then again, if we include the restrictions in SHACL shapes, then we can decide on shape level iff it's string or IRI. There we can even provide 2 alternatives with 2 different explanations.

andimou commented 2 years ago

Another thought, I debate myself. Newer libraries might read the namespaces from the file, would we still want to give the option to define the namespaces?

pmaria commented 2 years ago

Another thought, I debate myself. Newer libraries might read the namespaces from the file, would we still want to give the option to define the namespaces?

In my experience this is not that trivial, especially in non-DOM based approaches, e.g. a streaming implementation. Namespaces can be defined inline in a document, so in theory a new namespace can be declared and used at the end of a document.

I have a strong preference to be able to declare this in the mapping. Tools can always also by default provide namespace detection as a service if it fits their architecture.

DylanVanAssche commented 2 years ago

I agree with @pmaria, extracting the XML namespaces is non trivial and may require consuming all XML first before any mapping can take place.

Then again, if we include the restrictions in SHACL shapes, then we can decide on shape level iff it's string or IRI. There we can even provide 2 alternatives with 2 different explanations.

SHACL can have an OR statement, but maybe to keep things straightforward we should have either a string or IRI, but not both?

chrdebru commented 2 years ago

@pmaria if they call them namespace names, then OK!

@DylanVanAssche XML namespaces are declared in attributes (strings) in XML. So maybe that definition comes from their technical constraints. The advantage of IRIs is that "sameness" is implied when reused, whereas now you have to explicitly state that two namespace objects (if you can call them like that) as the same, or you infer them by comparing strings. So IRIs may help us in cases where we have different prefixes for the same namespace (e.g., combining mappings).

DylanVanAssche commented 2 years ago

@chrdebru I don't have a specific preference, except that I prefer either strings or IRIs, just not both ;)

kg-construct / rml-io

XML namespaces for XPath #4