ropensci / RNeXML

Implementing semantically rich NeXML I/O in R
https://docs.ropensci.org/RNeXML
Other
13 stars 9 forks source link

rownames_as_col option for get_characters() not working as expected #135

Closed hlapp closed 8 years ago

hlapp commented 8 years ago

The rownames_as_col parameter for get_characters() is documented as this:

option to return character matrix rownames (with taxon ids) as it's own column in the data.frame. Default is FALSE for compatibility with geiger and similar packages.

However, the result only includes the taxon names, not the IDs:

> get_characters(nex,rownames_as_col = TRUE)
                     taxa pelvic splint anterior dentation of pectoral fin spine
1        Ictalurus pricei             1                                        1
2         Ictalurus lupus             1                                        1
3      Ictalurus balsanus             1                                        0
4      Ictalurus furcatus             1                                        0
5     Ictalurus punctatus          <NA>                                        1
6     Ictalurus australis             1                                        1
7 Ictalurus sp. (Mo 1991)             0                                     <NA>
8       Ictalurus dugesii             1                                     <NA>
9     Ictalurus mexicanus             1                                     <NA>
  anterior distal serration of pectoral fin spine
1                                               1
2                                               1
3                                            <NA>
4                                               1
5                                               1
6                                               1
7                                            <NA>
8                                               1
9                                               1

The taxon IDs are missing. This means that matching rows with the value returned by get_taxa() remains ambiguous (because labels, i.e., taxon names, are not required to be unique).

cboettig commented 8 years ago

Okay, I think this is fixed in the above commit. Please re-open if the issue persists or otherwise needs re-working.

xu-hong commented 8 years ago

I found the parameters are not working as designed. The resulting columns are somehow messed up:

> nex <- nexml_read("https://raw.githubusercontent.com/phenoscape/rphenoscape/char-annots-example/inst/examples/ontotrace-result.xml")
> get_characters(nex)
                        anterior dentation of pectoral fin spine anterior distal serration of pectoral fin spine
Ictalurus pricei                                               1                                               1
Ictalurus lupus                                                1                                               1
Ictalurus balsanus                                             0                                            <NA>
Ictalurus furcatus                                             0                                               1
Ictalurus punctatus                                            1                                               1
Ictalurus australis                                            1                                               1
Ictalurus sp. (Mo 1991)                                     <NA>                                            <NA>
Ictalurus dugesii                                           <NA>                                               1
Ictalurus mexicanus                                         <NA>                                               1

According to original file there should be three columns for anatomical entities - "pelvic splint" is missing here!

otu_id is acting weird:

> get_characters(nex, otu_id = T)
                         otu anterior dentation of pectoral fin spine anterior distal serration of pectoral fin spine
Ictalurus pricei           1                                        1                                               1
Ictalurus lupus            1                                        1                                               1
Ictalurus balsanus         1                                        0                                            <NA>
Ictalurus furcatus         1                                        0                                               1
Ictalurus punctatus     <NA>                                        1                                               1
Ictalurus australis        1                                        1                                               1
Ictalurus sp. (Mo 1991)    0                                     <NA>                                            <NA>
Ictalurus dugesii          1                                     <NA>                                               1
Ictalurus mexicanus        1                                     <NA>                                               1

I think the values of otu column here should be pelvic splint's.

The following matrices are not working properly as well.

> get_characters(nex, rownames_as_col =  T)
  pelvic splint anterior dentation of pectoral fin spine anterior distal serration of pectoral fin spine
1             1                                        1                                               1
2             1                                        1                                               1
3             1                                        0                                            <NA>
4             1                                        0                                               1
5          <NA>                                        1                                               1
6             1                                        1                                               1
7             0                                     <NA>                                            <NA>
8             1                                     <NA>                                               1
9             1                                     <NA>                                               1
> get_characters(nex, rownames_as_col = T,  otu_id = T)
                      otu pelvic splint anterior dentation of pectoral fin spine anterior distal serration of pectoral fin spine
1        Ictalurus pricei             1                                        1                                               1
2         Ictalurus lupus             1                                        1                                               1
3      Ictalurus balsanus             1                                        0                                            <NA>
4      Ictalurus furcatus             1                                        0                                               1
5     Ictalurus punctatus          <NA>                                        1                                               1
6     Ictalurus australis             1                                        1                                               1
7 Ictalurus sp. (Mo 1991)             0                                     <NA>                                            <NA>
8       Ictalurus dugesii             1                                     <NA>                                               1
9     Ictalurus mexicanus             1                                     <NA>                                               1

I suspect there is one common reason (mishandling of columns) for all of these.

cboettig commented 8 years ago

Thanks for the report! Will investigate

cboettig commented 8 years ago

Okay, sorry for the delay, I've pushed a fix to this in the new branch, https://github.com/ropensci/RNeXML/tree/fix-get-characters which should resolve this error. (@hlapp I've also added the option to get otus id as well as the otu id now automatically).

This does highlight the question of when the get_characters function should be substituting labels for ids and when it shouldn't. As usual, this problem stems from the typical R package data structures where there are no ids (or rather, as is typical of domain researchers, columns such as the trait values are named with abbreviations that are not quite ids but not quite descriptive labels either). Here we see this issue arise for both OTUs (id vs otu label) and for character traits. I am not sure how best to handle it.

hlapp commented 8 years ago

@cboettig - just FYI, the committer shows as "rstudio" for the commits on that branch. Perhaps forgot to configure git?

hlapp commented 8 years ago

This does highlight the question of when the get_characters function should be substituting labels for ids and when it shouldn't. As usual, this problem stems from the typical R package data structures where there are no ids (or rather, as is typical of domain researchers, columns such as the trait values are named with abbreviations that are not quite ids but not quite descriptive labels either). Here we see this issue arise for both OTUs (id vs otu label) and for character traits. I am not sure how best to handle it.

My take on this is to have the default behave comparable to what users are most likely to be used to or expect, even if that's not ideal in the case of labels, but to make it easy to get at the IDs for those that want to do something with them. It's the latter that's usually difficult or impossible and neglected, and getting ambiguous labels by default isn't a bad thing if it's easy to map these to identifiers fit for computational integration.

cboettig commented 8 years ago

@hlapp yeah, was running from a docker instance with default git config, whoops.

Right, that makes sense. I've set the thing to return labels whenever possible. I've now added a mechanism to detect if the data lacks labels, or has any non-unique labels (which make it impossible to do the table joins one expects of ids), in which case RNeXML will return id values.