ropensci / RNeXML

Implementing semantically rich NeXML I/O in R
https://docs.ropensci.org/RNeXML
Other
13 stars 9 forks source link

get_characters() not detecting the matrix in nexml file #125

Closed xu-hong closed 8 years ago

xu-hong commented 8 years ago

I tried the following code to parse the nexml (stored as test.xml) returned by the OntoTrace API provided by Phenoscape Knowledgebase:

nex <- nexml_read("https://raw.githubusercontent.com/xu-hong/rphenoscape/master/inst/examples/test.xml")
get_characters(nex)

But I was getting NULL as the matrix returned by get_characters(nex). The nexml file seems valid with matrix exists in it, but somehow the function was not able to correctly parse it.

hlapp commented 8 years ago

@cboettig or @sckott if you could look into this soon that'd be great. It's essentially a showstopper right now for using RNeXML for the RPhenoscape package.

sckott commented 8 years ago

I'll take a look

sckott commented 8 years ago

hmmm, the file doesn't validate on nexml.org http://www.nexml.org/ - doe that not matter? (or do via RNeXML::nexml_validate)

here https://github.com/xu-hong/rphenoscape/blob/master/inst/examples/test.xml#L35 you have xsi:type="ns:StandardCells", whereas the relevant fxn here is looking for nex:StandardCells https://github.com/ropensci/RNeXML/blob/master/R/get_characters.R#L24-L39

sckott commented 8 years ago

@xu-hong @hlapp okay, i pushed a change to a different branch, fixes for me locally, can test from there,

devtools::install_github("ropensci/RNeXML", ref = "sckott-char-fix")

@cboettig not sure if this change is the best one.

thoughts?

rvosa commented 8 years ago

In principle this ought to work: the namespace declaration specifies the prefix 'ns' as being bound to 'http://www.nexml.org/2009' so actually this is correct. In the R code the prefix 'nex' is hardcoded, when one should not actually make the assumption that that prefix is always used. Better would be to make a map of namespace URIs to prefixes (and back) so that foo:StandardCells also works as long as the declaration xmlns:foo=" http://www.nexml.org/2009" is inherited from any of the containing elements.

On Wed, Sep 23, 2015 at 1:37 AM, Scott Chamberlain <notifications@github.com

wrote:

hmmm, the file doesn't validate on nexml.org http://www.nexml.org/ - doe that not matter?

here https://github.com/xu-hong/rphenoscape/blob/master/inst/examples/test.xml#L35 you have xsi:type="ns:StandardCells", whereas the relevant fxn here is looking for nex:StandardCells https://github.com/ropensci/RNeXML/blob/master/R/get_characters.R#L24-L39

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/125#issuecomment-142453429.

sckott commented 8 years ago

thanks @rvosa , looking in to mapping namespaces

sckott commented 8 years ago

Okay, updated namespace mapping https://github.com/ropensci/RNeXML/compare/master...sckott-char-fix - ns was missing

Also generalized get_characters_list() to not be hard-coded to a specific namespace -

all examples and tests pass, so I think this shouldn't break anything

devtools::install_github("ropensci/RNeXML", ref = "sckott-char-fix")

cboettig commented 8 years ago

Nice work Scott in tracking it down to the namespace issue. Looks like this is essentially a version of issue #51, where the problem is that the attribute values, as well as the attribute names, are namespaced. The XML package has native support for arbitrary namespace abbreviations in any XML fields (node names, attribute names, etc) but the attribute values are treated as strings and I'm not sure if there's a good generic fix for that. @rvosa how have you handled this in other languages? Do they just support namespaced values out of the box?

Hilmar suggested expanding all the prefix values using the full namespace definitions, presumably by just doing regex substitution on every single value field in the XML document, though I think implementing that in way that isn't too fragile would be tricky. (For instance, if I recall correctly XML permits omitting the namespace and inheriting it from a default, e.g. <state> instead of <nex:state> , and these can be inherited from parent elements as well as from the top-level, right?) So in general I'm not sure what the right way to solve this problem is. For the moment, I think Scott's solution is best, in which we recognize StandardCells etc regardless of the namespace to which it has been assigned.

sckott commented 8 years ago

@cboettig Okay, let let me know if you want me to PR that branch to master

cboettig commented 8 years ago

@sckott yup, go ahead with a PR.

Looking at @rvosa 's comments in the #51 thread, it looks like namespaces on values is a generic problem that probably isn't automatically handled by other NeXML parsers either, so IMHO your solution is best (e.g. less fragile than any of the alternatives suggested).

rvosa commented 8 years ago

Hilmar suggested https://github.com/ropensci/RNeXML/issues/51#issuecomment-48205335 expanding all the prefix values using the full namespace definitions, presumably by just doing regex substitution on every single value field in the XML document, though I think implementing that in way that isn't too fragile would be tricky. (For instance, if I recall correctly XML permits omitting the namespace and inheriting it from a default, e.g. instead of nex:state , and these can be inherited from parent elements as well as from the top-level, right?)

They can be inherited from any ancestral element that can override each other, the most recent ancestor taking precedence. Doing a global search and replace could therefore result in hard to track down bugs.

My sense is that it's really best to have an actual mapping such that from any focal element you can traverse towards the root and find the nearest ancestor that binds prefix 'foo' to a namespace.

What @hlapp suggests is basically precomputing this approach during an initial traversal so that in pre-order you read/update the mapping for 'foo' as you move towards the tips and then replace 'foo:bar' with '{ http://example.org}:bar' as you encounter it. When you look at how XML editing / processing tools internally represent namespaced elements and attributes, that's basically what they do (including the curly braces, even).

So in general I'm not sure what the right way to solve this problem is. For

the moment, I think Scott's solution is best, in which we recognize StandardCells etc regardless of the namespace to which it has been assigned.

That's probably going to work for 99% of all cases, at least until someone else comes up with the same name in a different namespace, at which point this will need to be addressed structurally. I'm actually not strongly opposed to pragmatically doing it this way if that means the current issues are fixed, as long as we're aware that it's quite possible we'll have problems down the line.

cboettig commented 8 years ago

Thanks Rutger!

What @hlapp suggests is basically precomputing this approach during an initial traversal so that in pre-order you read/update the mapping for 'foo' as you move towards the tips and then replace 'foo:bar' with '{ http://example.org}:bar' as you encounter it. When you look at how XML editing / processing tools internally represent namespaced elements and attributes, that's basically what they do (including the curly braces, even).

Right, but I guess what I'm not clear on here is how we are supposed to handle the case where no namespace is applied, and thus there is no : to split upon (or otherwise the : is not related to a namespace). e.g. one can have:

<characters xsi:type="ns:StandardCells" ...

instead of the more explicit:

<ns:characters xsi:type="ns:StandardCells" ..

So if we are playing by the rules of inheritance in xml, shouldn't it also be valid to have:

<characters xsi:type="StandardCells" ..

Where the ns is assumed for both? Or do we always namespace attribute values? If we can omit them, I'm not sure how to do the preprocessing, since I don't think we'd want to expand all attribute values to the default, since some of these are explicitly not namespaced, e.g. lengths, or symbol=1 shouldn't become symbol=http://www.nexml.org/2009:1 right? Happy to fix this but I just haven't figured out the implementation in my head because of these details.

That's probably going to work for 99% of all cases, at least until someone else comes up with the same name in a different namespace

Right. And Scott's implementation is specific to the xsi:type value of the <characters> element, so it would have to be a namespace that defined valid for that specific location, and not just any namespace, right? I get the value of namespaces in general but in this specific context it really feels like it is going to be much safer to treat the value of an attribute as a string.

As you commented in #51, it sounds like this case might be trouble for other NeXML parsers as well, which might hardwire a particular prefix rather than attempt to namespace the attribute values.

I've merged Scott's fix to master, so feel free to close if the current fix seems to address the issue.

rvosa commented 8 years ago

Hi Carl,

Right, but I guess what I'm not clear on here is how we are supposed to handle the case where no namespace is applied, and thus there is no : to split upon (or otherwise the : is not related to a namespace). e.g. one can have:

<characters xsi:type="ns:StandardCells" ...

instead of the more explicit:

<ns:characters xsi:type="ns:StandardCells" ..

In the former case, without a prefix, the default namespace of the focal context applies to the element. This namespace can be identified by traversing from the focal element to the root looking for the nearest xmlns="http://example.org", as opposed to xmlns:ns="http://example.org".

In most NeXML documents this will lead you back to the root element, which often has an xmlns="http://www.nexml.org/2009" on it. Of course for any given focal element only a single default namespace can be in effect. Multiple xmlns="" statements on the same enclosing element are invalid. If there are nested default namespace declarations, the most recent ancestor takes precedence.

So if we are playing by the rules of inheritance in xml, shouldn't it also be valid to have:

<characters xsi:type="StandardCells" ..

Nice try, and that would make total sense if in this case we were playing by the rules of namespace prefix inheritance in xml attribute values. Unfortunately, we've now left that particular rule book and are moving on to some other ones. In NeXML there are three separate situations where we have attribute values that have colons in them:

  1. the current case where we specify which schema subtype an element instance implements. This is done using these xsi:type statements, and the value must always be a QName, i.e. fully qualified, with prefix and colon. In implementing a mapping between prefixes and namespaces that, at least, should be some relief.
  2. the case where we specify what the datatype is of the content attribute's value of a meta element. For example, datatype="xsd:string". Here we also reference a schema (sub)type. Hence: must be a QName.
  3. property names in meta elements. In that case the rules of RDFa apply, i.e. the property name must be a CURIE (http://www.w3.org/TR/curie/), which nearly always are QNames, though the production rules of the W3C CURIE grammar imply that a CURIE may not have a prefix, but still have a colon (i.e. property=":foo"), or even omit the colon.

For our parsers case 1. is by far the most important, especially if we assume that our parsers don't actually try to interpret the semantics of property names or the data types of content values in meta elements. The fact that CURIEs are also intended to be used in non-XML documents (i.e. totally outside of the context of this namespace malarkey) suggests that we can more or less safely treat property names as string literals when internally representing meta annotations if all we care about is roundtripping.

Of course if you do want to do something with CURIEs and you want to be sure that "dc:title" is in fact the "title" predicate from the Dublin Core vocabulary, well, then you will have to start scanning for namespace prefixes in the ancestors.

So, in the simple case of us not actually trying to interpret the semantics of meta annotations:

Where the ns is assumed for both? Or do we always namespace attribute values? If we can omit them, I'm not sure how to do the preprocessing, since I don't think we'd want to expand all attribute values to the default, since some of these are explicitly not namespaced, e.g. lengths, or symbol=1 shouldn't become symbol=http://www.nexml.org/2009:1 right? Happy to fix this but I just haven't figured out the implementation in my head because of these details.

The number 1 is a simple literal, so it wouldn't really be namespaced. By a pretty poor analogy you can think of this along similar lines as the primitives in Java (int, char, etc.) versus objects. For objects, their behaviour is determined by the class they were instantiated from and which therefore are sort-of namespaced in that class, whereas primitives are not. I know this doesn't help that much - it just illustrates that both XML schema as well as Java are pretty inconsistent in similar ways.

That's probably going to work for 99% of all cases, at least until someone else comes up with the same name in a different namespace

Right. And Scott's implementation is specific to the xsi:type value of the element, so it would have to be a namespace that defined valid for that specific location, and not just any namespace, right? I get the value of namespaces in general but in this specific context it really feels like it is going to be much safer to treat the value of an attribute as a string.

Well, maybe. The xsi:type pops up in a number of places, one of which specifies the subtype that a character state matrix implements. The other two cases are that it occurs on tree elements (in practice this is basically to specify whether branch lengths are integers or reals) and on meta elements (as noted above).

In all three cases, some kind of branching or dispatching depending on the specific subtype is probably going to have to be done (maybe not with the number types for branch lengths?) so it makes sense to think for a bit about a reasonably generic approach that makes as little assumptions about usually-but-not-always-hardcoded prefixes as possible.

As you commented in #51 https://github.com/ropensci/RNeXML/issues/51, it sounds like this case might be trouble for other NeXML parsers as well, which might hardwire a particular prefix rather than attempt to namespace the attribute values.

I've merged Scott's fix to master, so feel free to close if the current fix seems to address the issue.

cboettig commented 8 years ago

@rvosa ah ha! Right, of course I had forgotten that these types were constrained by the schema to begin with, so that makes perfect sense. Thanks much for the detailed reply, that does clear things up for me.

hlapp commented 8 years ago

hmmm, the file doesn't validate on nexml.org http://www.nexml.org/ - doe that not matter? (or do via RNeXML::nexml_validate)

Just for the record, the file had to be modified from the original returned by the Phenoscape API, because otherwise RNeXML throws an error when loading it. See #124.

hlapp commented 8 years ago

@cboettig not sure if this change is the best one.

  • I think 0340607?w=1#diff-0a17c9effc561b26fb9890f0d9fcec7cR214 makes sense, but
  • not sure if 0340607?w=1#diff-0a17c9effc561b26fb9890f0d9fcec7cR24 and
  • 0340607?w=1#diff-0a17c9effc561b26fb9890f0d9fcec7cR30 is the best solution,

thoughts?

Perhaps this will help in the short-term with the output the Phenoscape API is returning (@balhoff - what are your plans to make this less verbose and more compliant with prefix conventions ? See #124) but I'm assuming we all understand that this is only a temporary fix?

cboettig commented 8 years ago

@hlapp Correct, this is a temporary fix for the problem described in the title of this issue thread. The deeper is issue has already been logged as issue #51, as noted above, which still awaits a proper solution. As highlighted in #51, this issue may be impacting other nexml parsers as well, since most XML libraries, e.g. libxml2 on which XML is built, deal with the namespaces of XML elements automatically but don't have a native implementation for namespaces of attributes; even though as Rutger clarified for me that use is well-defined in via the schema file.

I feel the long term issue here would be far better off if it was addressed by more fundamental XML/XML schema infrastructure to handle namespaces on attributes rather than just patched for RNeXML alone, but I don't think I'm in a good position presently to do this. Maybe Duncan's XMLSchema package or something else will one day provide us the best tools to handle this case.

balhoff commented 8 years ago

@rvosa I was reading back through this thread and want to question one point. Above where you say

  1. the current case where we specify which schema subtype an element instance implements. This is done using these xsi:type statements, and the value must always be a QName, i.e. fully qualified, with prefix and colon. In implementing a mapping between prefixes and namespaces that, at least, should be some relief.

I don't think that is right. A QName does not have to contain a colon. If it has no prefix, it follows the default namespace. You can see this here: http://www.w3.org/TR/1999/REC-xml-names-19990114/#NT-QName

I also tested this in Oxygen and it validates. So you don't need to prefix the values for xsi:type="ResourceMeta".

rvosa commented 8 years ago

Yeah, our progressing insight has indeed led us to conclude that a prefix is NOT required, i.e. xsi:type="..." can be specified without it. You're right, and I wasn't.

balhoff commented 8 years ago

Every time an issue comes up with this stuff I end up slowly re-learning all the details that I had previously figured out and forgotten.

cboettig commented 8 years ago

I think this is also fixed now since merging #133