ropensci / RNeXML

Implementing semantically rich NeXML I/O in R
https://docs.ropensci.org/RNeXML
Other
13 stars 9 forks source link

xml matrix with polymorphic states #161

Closed laurajackson closed 5 years ago

laurajackson commented 7 years ago

It seems that when reading in an XML matrix file into R using the RNeXML function (read.nexml), any cells that contain a polymorphic state (i.e., both 0&1 in a single cell) are improperly read as NA. Is this a bug with the package, or is there a specific code to fix this issue?

cboettig commented 7 years ago

Can you link me to an example of that and I'll take a look

laurajackson commented 7 years ago

Here is a small example file. I am generating the matrices using the Phenoscape Knowledgebase.

Testmatrix.xml.zip

cboettig commented 7 years ago

Okay, I think this should be patched in the commit I just pushed.

Note that the issue here (as you may know) is that for a polymorphic or uncertain state, symbol is an xs:string, while usually it is defined as a positive integer (StandardToken), see http://nexml.org/doc/schema-1/characters/standard/#StandardUncertainStateSet vs http://nexml.org/doc/schema-1/characters/standard/#StandardToken .

So get_characters will still return this as missing, even though the S4 class will no longer coerce it to an NA:

> nex <- nexml_read("Testmatrix.xml") 
> RNeXML::get_characters_list(nex)
$`f6794331-d603-4574-9a66-3788daaa8095`
                         pectoral fin pectoral fin skeleton
Synbranchidae                    <NA>                  <NA>
Synbranchus marmoratus              0                     0
Ophisternon aenigmaticum            0                     0
Ophisternon bengalense              0                     0
Monopterus albus                    0                     0
Monopterus cuchia                   0                     0
Monopterus boueti                   0                     0
Ophisternon infernale               0                     0
Ophisternon afrum                   0                     0
Ophisternon candidum                0                     0
Synbranchus madeirae                0                     0
Synbranchus lampreia                0                     0

> nex@characters[[1]]@format@states[[1]]@polymorphic_state_set[[1]]@symbol
   symbol 
"0 and 1" 

. Perhaps get_characters should use character class instead of an integer class for states? (Factors might be closer to the intended concept, internally they are also positive integers but can look like characters, which is responsible for many errors and much cursing about stingsAsFactors=)

Leaving things as they are in the patch with get_characters trying to make things integers is perhaps closer to the expected behavior defined by the schema though. I'm open to suggestions here.

rvosa commented 7 years ago

States were made integers only to make it all work within the confines of XML schema language in such a way that compact, "standard", character state sequences (space-separated strings) could be validated. There is certainly no conceptual reason why categorical states should have integer semantics; factors make more sense under nearly all circumstances (unless you think ordered states under maximum parsimony are integers).

As an aside, I would not use the string "0 and 1" as a state symbol, because a stupid converter might simply insert that in a character state sequence (say, in a nexus file) and thereby invalidate it. I would just use "2", given that the element structure already explains that this maps onto "{0,1}".

cboettig commented 7 years ago

Thanks @rvosa , that makes perfect sense.

cboettig commented 5 years ago

looks like this is resolved. @hlapp I'm noting @rvosa's comment above though as relevant to the discussion in #172,

As an aside, I would not use the string "0 and 1" as a state symbol, because a stupid converter might simply insert that in a character state sequence (say, in a nexus file) and thereby invalidate it. I would just use "2", given that the element structure already explains that this maps onto "{0,1}".

hlapp commented 5 years ago

See also thread in phenoscape/scate#8. In a nutshell, using 2 (or, more generally, the digit that's incremented by 1 from the maximum standard state symbol in the character) would be valid according to the schema, maintain readability by tools when converted to less expressive formats such as NEXUS, but carries the danger that the user simply passes the file into a downstream comparative analysis program, which would likely assume that states 0 and 2 are as distinct from each other as states 0 and 1, which could easily lead to spurious results (i.e., producing some false positive).

This compares to using a symbol that either throws up an error when parsed (forcing the user to pay attention and make a decision about how to deal with the problem), or a symbol that by default will be treated by most programs as missing (or uncertain) data. The latter may result in false negatives downstream due to potential loss of signal in the data, but arguably that's better than giving users rope to hang themselves with.