Handling of polymorphic and uncertain character states is badly broken

hlapp commented 6 years ago

Perhaps this is now amplified by recent behavior changes in dplyr, but here's an enumeration of the problems I am finding:

The symbol of a polymorphic_state_set may not be an integer or numeric, leading to the issue reported in #171. The same will likely be true for uncertain_state_set in a morphological character matrix. (In a DNA or protein sequence matrix, non-polymorphic and unambiguous states will be letters, which hides the type mismatch issue.)
Joining cells to states (get_characters.R, lines 65-68) fails to map symbols for polymorphic and uncertain states, because the bind_rows() in lines 59 and 61, respectively, doesn't treat state and polymorphic_state_set as the same column, as the code somehow seems to assume it does.
The code for setting the type of each character column seems to assume that the xsi.type of the enclosing <characters/> element determines the type. However, this is not justified by the schema at all, and will likely be violated by polymorphic states.
The function na_symbol_to_state() (called in line 76) seems to assume that it would make sense to coerce the state column (an identifier reference) to a numeric value if the symbol is NA. Perhaps I misunderstand the code but there is nothing in the schema that justifies this as a viable idea.

cboettig commented 6 years ago

@hlapp yeah, it really looks like polymorphic and uncertain states were never really implemented properly. I suspect this was due to an over-reliance on the corresponding ape / phylo data structures, which iirc don't have support for polymorphic states (or is that not the case)?

In fact, I'm really not sure what the 'correct' behavior should be, I think the data.frame format returned by get_characters() is fundamentally at odds with expressing polymorphic states? e.g. in R you can have such a thing as a length 3 as.factor(c("A", "B", "C")), but polymorphism would just look like an additional state type and not a true polymorhism, right? as.factor(c("A", "B", "B | C"))

So this is definitely a bug, but I think also indicative this a something of a fundamental design flaw in RNeXML, which is really geared around converting directly between formats like an R character matrix and the NeXML representation directly. Hence all this internal use of dplyr operating on internal objects represented as data.frames that in reality cannot always be represented as data.frames (i.e. when a state is polymorphic).

I think the correct approach should create a direct representation of the NeXML schema in R, and always operate on that representation directly. RNeXML approximates this with S4, but then the functions you are calling aren't operating on the S4 objects directly, but rather on these data.frame representations, which I think was a mistake.

The S4 classes are hard to work with, but I'm convinced we have a much better approach in https://github.com/nexld, where we have a native representation of (I think!) the full schema in S3, and are able to operate on that directly without lossy coercion into a data.frame, phylo object, etc. In fact, the thing that's missing from nexld at the moment is that it has no ability to turn NeXML into a phylo or other standard phylogenetic type in R; it only knows about it's schema-compliant S3 representation and how to read and write that from XML.

Can you share a little bit more about what you are trying to do? nexld might already be able to do some of it? We would have to think through a bit more about the correct behavior for a function like get_characters() still...

hlapp commented 6 years ago

I don't think it's quite as bad as you paint it. Regarding the nexld project, I think we have short and long-term goals here, and I'm apprehensive of mixing the two together. To elaborate:

The problem we're having is I think mostly or entirely confined to non-molecular matrices. For example, for DNA matrices there are ambiguity codes that look, taste, and feel like DNA base symbols, and there are well-known conventions for what they encode. All that a consuming analysis algorithm might need to know is whether the position that shows N is polymorphic or uncertain (missing data). These conventionally well-defined mappings don't exist for morphological characters, and so there's no convention or standard for expressing which states make up a polymorphic character state. As a result, very few analysis codes can deal with this anyway.

This problem, i.e., how to present the data to a consumer in the comparative analysis tool ecosystem so they can digest it, I would argue is the main problem. It won't go away by changing to an internal nexld representation.

So in the short term, this is what I think we need to aim for, and I think can with the representation we already have:

Return a character matrix, with symbol coding that's at least not wrong. Right now, the code throws an exception, and even if we made the small needed fix to prevent that, the resulting symbol coding would be wrong (because a polymorphic state is not missing data).
Conventionalize on a reasonable symbol coding that is conservative for (in the sense of not easily leading to wrong analysis results), yet interpretable by a consumer. For example, we could generally put ? for a polymorphic state set, and NA for an uncertain state set. (This would equate polymorphic states for the same character that differ in member states. However, most tools would treat ? as missing data anyway, so in practice it wold be largely inconsequential. Same for uncertain states.) We could also simply take the symbol that's in NeXML, though this would also just kick the can down the road to the consumer, because they most likely can't deal with it.
Enable consumers to make their own decisions. That is, on top of adhering to some documented convention for symbols for polymorphic and uncertain states, provide an API that allows a consumer to query. Something like is_polymorphic() and is_uncertain() that returns a TRUE/FALSE matrix in the same arrangement as the character matrix. And member_states() that returns the members of polymorphic and uncertain state sets for those consumers that can actually use this information.

What would be really great if we had a data structure that would carry the referential integrity around, i.e., the IDs and ID references in the NeXML, but to a consumer for all intents and purposes would appear as a data frame (or a matrix). This would then be robust to character column re-arrangements or name changes, and filtering rows or columns. In a way, the nexml object is such a data structure, but I guess we both agree it's way too unwieldy for enabling flexible and well-performing manipulation. I haven't seen enough of the nexld experiment to see how that can be cast so that dplyr and other data frame methods can transparently operate on it, but I suspect it's still quite a bit away from that.

cboettig commented 6 years ago

Okay, good points all around.

It sounds like a function such as get_characters() should simply take optional arguments for how polymorphic and uncertain states are to be encoded in the matrix, e.g. using the defaults you suggest in the second bullet:

get_characters(nexml, polymorphic = "?", uncertain = "NA")

(with some subtly as to whether "NA" should be treated as a character string in R or as a literal NA.) That would provide simple default behavior but some level of user control and awareness.

Af for carrying the object around, I think you're right that the NeXML itself (or at least the S4 version of it in R) is the definitive solution there. I think get_characters() should return just an honest data.frame that acts as a data.frame with no magic (e.g. you can write it out to csv, read it back in, pass it to any R function that works with a data.frame). We've found that adding hidden attributes to data.frames with additional metadata always goes badly in the end (i.e. can always create unexpected behavior).

Yeah, nexml is a bit unwieldy for this purpose, which means that functions like get_characters have to work pretty hard to go from (and back to) nexml. As you know, nexld is converting the XML to JSON under the hood first, which results in an object with a lot simpler parsing logic. jsonlite already defines nice round-trippable conventions for converting JSON into data.frames, and functional programming approaches from tools like purrr (which didn't exist when we wrote RNeXML) make working with JSON lists a lot easier. Still, like you say this is more concept than practice yet, so patching RNeXML::get_characters() is probably more expedient...

cboettig commented 6 years ago

Re:

The function na_symbol_to_state() (called in line 76) seems to assume that it would make sense to coerce the state column (an identifier reference) to a numeric value if the symbol is NA. Perhaps I misunderstand the code but there is nothing in the schema that justifies this as a viable idea.

Yeah, I agree the code here is way more opaque than it should be, which is largely a consequence of attempts to treat everything as data.frames. Short version is that these NAs in symbol only appear on continuousTratis, which don't have a symbol value, and so in the long-from table representation, have NAs for symbol.

In the code as written, I don't think NA can arise another way, since NA is not a permitted value. But that might change, or we might introduce NAs explicitly as discussed for uncertain states.

What is happening here is that we constructed a "long-form" table that tries to express both continuous trait data and discrete trait data in the same schema. The value of a discrete trait is stored in "symbol" attribute (and thus in the symbol column), while the value of a continuous state is stored in "state" column.

I'm not really sure that the approach taken here with the dplyr-based manipulation of intermediate tabular formats was really such a great idea; particularly due to the potential for type conflicts (e.g. attempting to put numeric/continuous trait data and text/discrete trait data in the same column before we tidyr::spread into long form. But I don't think this is currently making any erroneous assumption.

hlapp commented 6 years ago

In the code as written, I don't think NA can arise another way, since NA is not a permitted value.

If everything were correct then presumably yes. However, the reason I stumbled over this was that the incorrect joining for cells (or more specifically, the fact that the join doesn't work for polymorphic and uncertain states, see second bullet point in the original issue post) leaves some symbols as NA.

I agree that once that's fixed, we shouldn't be hitting this anymore erroneously.

cboettig commented 6 years ago

I know you noted that #174 doesn't completely address everything here, e.g. does not yet give users direct control over the symbols, (though it's not entirely obvious to me that it should). Still, I'd propose we close this as I think the title issue is resolved, and maybe can consider whether we should open a separate issue to track the remaining items?

hlapp commented 6 years ago

I agree. Also, I'm not sure that adding parameters that would allow

get_characters(nexml, polymorphic = "?", uncertain = "NA")

is necessarily worth it if what I think we really need is giving users simple and full access to the information the nexml object. So I created #175, and I think once that's in place it's easy enough for the user to accomplish the above themselves.

hlapp commented 6 years ago

Also, for the record I'll add a side note here regarding polymorphic state sets. The schema does allow uncertain state sets to be nested within (i.e., to be a member of) a polymorphic state set.

This is currently not handled at all by the code here. It also seems like a real fringe case that for the foreseeable future at least within Phenoscape I don't see becoming a use case.

hlapp commented 6 years ago

@cboettig should we close this out now?

cboettig commented 6 years ago

sounds good

ropensci / RNeXML

Handling of polymorphic and uncertain character states is badly broken #172