phenoscape / scate

Project repo for Enabling Machine-actionable Semantics for Comparative Analysis of Trait Evolution (SCATE)
http://scate.phenoscape.org
0 stars 0 forks source link

ontotrace nexml validation #8

Open uyedaj opened 5 years ago

uyedaj commented 5 years ago

@balhoff When I run the nexml files that Wasila sent me (https://github.com/phenoscape/scate/tree/master/data) through ontotrace through the validator (http://www.nexml.org/phylows/validator) I get the following error (here Malabarba-1998_Ontotrace.xml). Looks like the issue is the polymorphism again?

going to read file 'Malabarba-1998_Ontotrace.xml'
reading from handle
created temporary file '/tmp/o3Bz65Cow4'
copied uploaded data to '/tmp/o3Bz65Cow4'
read file 'Malabarba-1998_Ontotrace.xml', copied contents to '/tmp/o3Bz65Cow4'
created java validator invocation
executing java validator: java -classpath /var/www/html/phylows/../downloads/validator.jar -Dxml=/tmp/o3Bz65Cow4 -Dxsd=/var/www/html/phylows/../xsd/nexml.xsd -Dns=http://www.nexml.org/2009 validator.XmlValidator
executing perl validator
initializing Bio::Phylo::Parsers::Nexml=HASH(0x1c155a0)
going to parse xml
processed <otus id="tfc794139-a39f-4f2a-aa2e-b611ac9f9166"/> (line 71)
processed <otu id="VTO_0065575"/> (line 71)
processed <otu id="VTO_0038118"/> (line 71)
processed <otu id="VTO_0038133"/> (line 71)
processed <otu id="VTO_0038106"/> (line 71)
processed <otu id="VTO_0038138"/> (line 71)
processed <otu id="VTO_0065893"/> (line 71)
processed <otu id="VTO_0065636"/> (line 71)
processed <otu id="VTO_0065788"/> (line 71)
processed <otu id="VTO_0038109"/> (line 71)
processed <otu id="VTO_0065571"/> (line 71)
processed <otu id="VTO_0065877"/> (line 71)
processed <otu id="VTO_0038108"/> (line 71)
processed <otu id="VTO_0065585"/> (line 71)
processed <otu id="VTO_0065747"/> (line 71)
processed <otu id="VTO_0038128"/> (line 71)
processed <otu id="VTO_0065095"/> (line 71)
processed <otu id="VTO_0038140"/> (line 71)
processed <otu id="VTO_0038135"/> (line 71)
processed <otu id="VTO_0038120"/> (line 71)
processed <otu id="VTO_0038111"/> (line 71)
processed <otu id="VTO_0038124"/> (line 71)
Processed block id: tfc794139-a39f-4f2a-aa2e-b611ac9f9166 (line 71)
going to parse characters element (line 236)
processed <characters id="cc4104afe-da7a-43e7-a1c5-cb4a40fc9161"/> (line 236)
processed <row id="r57d9b9d3-75bc-498e-a165-60330c7ecabd"/> (line 236)
set char: '1 1 and 0 1 0' (line 236)
Invalid data for row Datum49 (type Standard: 1 1 and 0 1 0)
Invalid data for row Datum49 (type Standard: 1 1 and 0 1 0)
balhoff commented 5 years ago

I think the issue is the polymorphism. It probably doesn't like that the state symbol for the polymorphic condition is 1 and 0. But the schema says that is okay.

balhoff commented 5 years ago

I’ll try to figure out if I can find where this is getting checked in the Perl.

balhoff commented 5 years ago

@rvosa does the Bio::Phylo NeXML validator handle polymorphic states? It seems to be complaining about a string character symbol.

balhoff commented 5 years ago

@uyedaj if you replace the polymorphism symbols with a number, e.g. "10", it validates. I believe this is an error in the validator. For now you can replace

symbol="1 and 0" or symbol="0 and 1"

with

symbol="2"

Will this work? Additionally you will want to make sure RNeXML is handling polymorphisms, if you are using that.

hlapp commented 5 years ago

As mentioned at last meeting, based on ropensci/RNeXML#171 I think right now polymorphic characters whose symbols aren't numeric are all going to run into the same issue.

What would be good to have while looking at possible solutions is the format that polymorphic character states normally take in matrices. Is it an extra numeric symbol (e.g., 2), or some concatenation of possible symbols (e.g., 1 and 0, or 1, 0, or 10?). I.e., what would most comparative methods code in R expect such as symbol to look like?

uyedaj commented 5 years ago

Most methods can't really handle polymorphic characters... DNA sequence data has dedicated characters for ambiguous bases, while I think the standard for binary characters is generally just to put a '?'. I think a lot of the nexus parsers will have issues if it's anything longer than a single character, so for now it seems like an extra numeric symbol would be a good option.

balhoff commented 5 years ago

@hlapp do you think I should update OntoTrace to choose a random unused digit (for presence/absence, of course this can be 2) to just make this work? As you might expect I would much prefer for tools that read NeXML to just handle it as designed. :-)

The "1 and 0" symbol works really well for display in Phenex, which is one reason I'm reluctant to change.

NEXUS has its own scheme to represent polymorphism, and I would expect libraries that read/write both to interconvert between them.

hlapp commented 5 years ago

@hlapp do you think I should update OntoTrace to choose a random unused digit (for presence/absence, of course this can be 2) to just make this work?

Based on @uyedaj's response I was indeed going to suggest that. However, I am also going to argue that making this change is in fact the right thing to do, and not just a kludge to make this work.

Technically, as I verified earlier against the NeXML schema, the type of the symbol attribute for polymorphic_state_set is indeed xs:anySimpleType, so strings are allowed. That the NeXML validator complains about this is thus a problem with the validator, not the data file at hand. However, for StandardState and StandardStateSet the type of the symbol attribute is integer. I would argue that allowing this much room for confusion in implementations is a mistake in the NeXML schema itself. Furthermore, as per @uyedaj's answer, it's also rather inconsistent with common practice in comparative method tools (which I would say runs counter to the original design goals of NeXML, as reflected in the prefix of its name, another reason why arguably this should be regarded as a mistake in the schema).

One of our goals for the API is to facilitate more widespread adoption of our data and capabilities. One aspect of this should be to return data in lossless formats that can be as straightforwardly as possible converted into the input formats needed by existing tool ecosystems. In most cases, the latter means NEXUS. Obviously, NEXUS can't encode well all the information that constitutes our data value propositions, so isn't lossless and hence we don't output NEXUS but instead do NeXML or JSON-LD. But one of the beauties of the NeXML format is that changing a PolymorphicStateSet's symbol from "1 and 0" (a multi-character string) to "2" (a single digit integer) incurs no loss of information, because the states in the set are explicitly defined and not meant to be only (or even at all) parseable from the symbol.

So instead of asking every single consumer implementation to have to perform this substitution, we can make it for them, without loss of information in the data that we return.

The "1 and 0" symbol works really well for display in Phenex, which is one reason I'm reluctant to change.

But to quote from your own response:

As you might expect I would much prefer for tools that read NeXML to just handle it as designed. :-)

Arguably, it is Phenex that doesn't treat NeXML data as it is designed. Namely, it's Phenex that gives semantic significance to the value of the symbol of polymorphic state set (namely, expecting it to show the states in the set), rather than, if and where needed, extracting this information from the data as NeXML was designed to support it.

rvosa commented 5 years ago

I have to dig in my memory as to why this was the way it was but I think the general idea was that we wanted "standard" symbols to be single characters. Polymorphic states would need to use a single character symbol not yet used elsewhere, and to give us enough headroom for this we allowed both strings and integers. That's why "1 and 2" validates, though it goes against the original idea. Does that make sense?

On Wed, Oct 3, 2018 at 11:41 PM Hilmar Lapp notifications@github.com wrote:

@hlapp https://github.com/hlapp do you think I should update OntoTrace to choose a random unused digit (for presence/absence, of course this can be 2) to just make this work?

Based on @uyedaj https://github.com/uyedaj's response I was indeed going to suggest that. However, I am also going to argue that making this change is in fact the right thing to do, and not just a kludge to make this work.

Technically, as I verified earlier against the NeXML schema, the type of the symbol attribute for polymorphic_state_set is indeed xs:anySimpleType, so strings are allowed. That the NeXML validator complains about this is thus a problem with the validator, not the data file at hand. However, for StandardState and StandardStateSet the type of the symbol attribute is integer. I would argue that allowing this much room for confusion in implementations is a mistake in the NeXML schema itself. Furthermore, as per @uyedaj https://github.com/uyedaj's answer, it's also rather inconsistent with common practice in comparative method tools (which I would say runs counter to the original design goals of NeXML, as reflected in the prefix of its name, another reason why arguably this should be regarded as a mistake in the schema).

One of our goals for the API is to facilitate more widespread adoption of our data and capabilities. One aspect of this should be to return data in lossless formats that can be as straightforwardly as possible converted into the input formats needed by existing tool ecosystems. In most cases, the latter means NEXUS. Obviously, NEXUS can't encode well all the information that constitutes our data value propositions, so isn't lossless and hence we don't output NEXUS but instead do NeXML or JSON-LD. But one of the beauties of the NeXML format is that changing a PolymorphicStateSet's symbol from "1 and 0" (a multi-character string) to "2" (a single digit integer) incurs no loss of information, because the states in the set are explicitly defined and not meant to be only (or even at all) parseable from the symbol.

So instead of asking every single consumer implementation to have to perform this substitution, we can make it for them, without loss of information in the data that we return.

The "1 and 0" symbol works really well for display in Phenex, which is one reason I'm reluctant to change.

But to quote from your own response:

As you might expect I would much prefer for tools that read NeXML to just handle it as designed. :-)

Arguably, it is Phenex that doesn't treat NeXML data as it is designed. Namely, it's Phenex that gives semantic significance to the value of the symbol of polymorphic state set (namely, expecting it to show the states in the set), rather than, if and where needed, extracting this information from the data as NeXML was designed to support it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/phenoscape/scate/issues/8#issuecomment-426812053, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGf-rJt7h_FKS0coWSdFnn6DuJZhjUZks5uhS74gaJpZM4W_lpF .

balhoff commented 5 years ago

@hlapp:

Arguably, it is Phenex that doesn't treat NeXML data as it is designed. Namely, it's Phenex that gives semantic significance to the value of the symbol of polymorphic state set (namely, expecting it to show the states in the set), rather than, if and where needed, extracting this information from the data as NeXML was designed to support it.

Hey, now I think this is getting a bit twisted. :laughing: I shouldn't be dinged for going the extra mile and generating a friendly label while staying within the spec.

I still think it's quite bad for readers of NeXML not to recognize polymorphic states. If that's not represented in their model they should do something like treat it as unknown instead of charging ahead. In NEXUS there is no additional state symbol generated anyway; a polymorphism would be represented as (0 1), which must be handled by the parser.

If we return polymorphisms as a third state symbol 2, I worry this opens the door to even more incorrect interpretations by incomplete parsers.

hlapp commented 5 years ago

I shouldn't be dinged for going the extra mile and generating a friendly label while staying within the spec.

I'm not dinging anyone here 😄 However, I think it's important to realize that the symbol attribute isn't a label attribute. Labels are indeed mostly meant for human readers, and thus can be discarded without impacting utility or readability for tools. (For example, labels in NEXUS are comments for a reason.) The symbol for a state isn't a label though, it's the piece of information a tool will use to compute with the data. So the symbol first and foremost needs to be tool-friendly, not human-friendly.

I still think it's quite bad for readers of NeXML not to recognize polymorphic states.

You mean human readers of a NeXML file, machine readers of a NeXML file, or users of a tool that lets them inspect the data in a NeXML file?

I think quite firmly that human readers of NeXML files is not, or at least not an important use case. XML is not for human readability, and pretending otherwise just leads down bad paths. A machine reader of a NeXML file will need a single-character symbol. For a user of a tool for inspecting a NeXML file, the tool does have the information needed to construct a user-friendly label for a polymorphic state (do you disagree with that?).

If that's not represented in their model they should do something like treat it as unknown instead of charging ahead.

I think that's what @uyedaj was saying is what is typically done, right? I.e., most comparative tools can't distinguish (in the sense of computing differently) between polymorphic and uncertain states.

In NEXUS there is no additional state symbol generated anyway; a polymorphism would be represented as (0 1), which must be handled by the parser.

That's a property of the NeXML format, and I agree that tools converting NeXML to NEXUS should probably write out a polymorphic state that way. But what to do depends on the target of the conversion.

If we return polymorphisms as a third state symbol 2, I worry this opens the door to even more incorrect interpretations by incomplete parsers.

Possibly, so maybe it should be a question mark instead, as recommended by @uyedaj and as is the common symbol used for this?

balhoff commented 5 years ago

Update—@hlapp has updated RNeXML to handle polymorphic states. I will update Phenoscape NeXML generation to use NEXUS-style polymorphism symbols (e.g. (0 1) for increased compatibility.