ropensci / RNeXML

Implementing semantically rich NeXML I/O in R
https://docs.ropensci.org/RNeXML
Other
13 stars 9 forks source link

"get_characters" function that is equivalent to get_taxa? #137

Closed xu-hong closed 8 years ago

xu-hong commented 8 years ago

Currently what the get_characters function really does is "get characters matrix". There is no equivalent function to get_taxa for anatomical entities.

> get_taxa(ne)
Source: local data frame [9 x 5]

           id                   label        about xsi.type                                  otus
        (chr)                   (chr)        (chr)    (lgl)                                 (chr)
1 VTO_0036225     Ictalurus punctatus #VTO_0036225       NA t0d4df580-2d92-4166-8518-a76116df5295
2 VTO_0061498     Ictalurus mexicanus #VTO_0061498       NA t0d4df580-2d92-4166-8518-a76116df5295
3 VTO_0061495     Ictalurus australis #VTO_0061495       NA t0d4df580-2d92-4166-8518-a76116df5295
4 VTO_0036221      Ictalurus balsanus #VTO_0036221       NA t0d4df580-2d92-4166-8518-a76116df5295
5 VTO_0036218        Ictalurus pricei #VTO_0036218       NA t0d4df580-2d92-4166-8518-a76116df5295
6 VTO_0036223      Ictalurus furcatus #VTO_0036223       NA t0d4df580-2d92-4166-8518-a76116df5295
7 VTO_0036220         Ictalurus lupus #VTO_0036220       NA t0d4df580-2d92-4166-8518-a76116df5295
8 VTO_0061497       Ictalurus dugesii #VTO_0061497       NA t0d4df580-2d92-4166-8518-a76116df5295
9 VTO_0061496 Ictalurus sp. (Mo 1991) #VTO_0061496       NA t0d4df580-2d92-4166-8518-a76116df5295

Users would want to see the mapping from label to char for anatomical entities as well. There should be a function that returns the data frame similar to what get_taxa returns. cc @hlapp

cboettig commented 8 years ago

Right, get_characters returns the characters matrix by default, since that's the object most people are probably familiar with. And as you know, we know have options to get little bits of more metadata, such as the otu and otus value.

I agree that it could behave more like get_taxa, but not quite sure what is needed. Keep in mind there's a lot of possibilities since the character data is a lot more complex than the taxa data.

I've added the (mostly) general function get_level(), which is used internally by all of these methods. You'll see get_taxa is really just get_level(nex, "otus/otu"). get_level() just returns a data.frame in which columns are the attributes of the specified level, and rows are the elements found at that level. I haven't really polished it for generic end-user use but feel free to play around with it and let me know if it is useful.

The problem here is that get_characters data is a lot more complex than get_taxa() -- there's not just one obvious table representation, but really at least 4 tables of interest: char, state (somehow handling the data from polymorphic & uncertain types), cell (with some data from row), as well as the otu table. You'll see all of these in play here: https://github.com/ropensci/RNeXML/blob/fix-get-characters/R/get_characters.R#L27-L43. (And of course there could be metadata tables associated with any one of these)

I'm not sure if each of these tables should have there own user-level method -- we do want to keep a reasonably consise namespace after all or it gets overwhelming to new users.

I agree that there is perhaps some information in the chars or states tables that we still aren't exposing, but I'm not sure what else to expose.

rvosa commented 8 years ago

Isn't it a bit of a misnomer to call a method that returns the matrix "get_characters". May expectation would probably be that it returns a list (or structure like that) of the matrix columns, with whatever metadata is attached to them.

On Wed, Nov 4, 2015 at 11:34 PM, Carl Boettiger notifications@github.com wrote:

Right, get_characters returns the characters matrix by default, since that's the object most people are probably familiar with. And as you know, we know have options to get little bits of more metadata, such as the otu and otus value.

I agree that it could behave more like get_taxa, but not quite sure what is needed. Keep in mind there's a lot of possibilities since the character data is a lot more complex than the taxa data.

I've added the (mostly) general function get_level(), which is used internally by all of these methods. You'll see get_taxa is really just get_level(nex, "otus/otu"). get_level() just returns a data.frame in which columns are the attributes of the specified level, and rows are the elements found at that level. I haven't really polished it for generic end-user use but feel free to play around with it and let me know if it is useful.

The problem here is that get_characters data is a lot more complex than get_taxa() -- there's not just one obvious table representation, but really at least 4 tables of interest: char, state (somehow handling the data from polymorphic & uncertain types), cell (with some data from row), as well as the otu table. You'll see all of these in play here: https://github.com/ropensci/RNeXML/blob/fix-get-characters/R/get_characters.R#L27-L43. (And of course there could be metadata tables associated with any one of these)

I'm not sure if each of these tables should have there own user-level method -- we do want to keep a reasonably consise namespace after all or it gets overwhelming to new users.

I agree that there is perhaps some information in the chars or states tables that we still aren't exposing, but I'm not sure what else to expose.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/137#issuecomment-153891249.

cboettig commented 8 years ago

@rvosa Thanks for feedback. I'm an afraid I'm really sure I understand what data you expect or how you'd like it to be formatted. A few observations about the current format:

hlapp commented 8 years ago

Just to be clear about what one of the driving use-cases is from the rphenoscape point of view, we need to be able to map character labels to character annotations, including identifiers (not to be confused with the document-local IDs for <char/> and other elements, which are the IDs used in the XML document to tie together entities). We can now use get_metadata(level="characters/format/char") to extract the annotations for characters, but they can't yet be mapped to the columns of the data.frame returned by get_characters() because the document-local IDs needed to map are only included in the result from get_metadata() but not in the result from get_characters(). So we're back to relying on order, which is brittle and bad.

For OTUs, this issue is now addressed by get_taxa() returning the document-local OTU IDs as one of the columns in the returned data.frame. We can ask get_characters() to include OTU IDs as a column, which enables mapping rows in the data matrix to OTU annotations. There isn't yet a way to ask get_characters() to include column (= <char/>) IDs.

Is that understandable?

cboettig commented 8 years ago

Ah, thanks! that makes perfect sense. So we need char id values? For now, can you try get_level(nex, "characters/format/char") (on the fix-characters branch)? I think that's the data you're looking for?

On Thu, Nov 5, 2015, 7:49 AM Hilmar Lapp notifications@github.com wrote:

Just to be clear about what one of the driving use-cases is from the rphenoscape point of view, we need to be able to map character labels to character annotations, including identifiers (not to be confused with the document-local IDs for and other elements, which are the IDs used in the XML document to tie together entities). We can now use get_metadata(level="characters/format/char") to extract the annotations for characters, but they can't yet be mapped to the columns of the data.frame returned by get_characters() because the document-local IDs needed to map are only included in the result from get_metadata() but not in the result from get_characters(). So we're back to relying on order, which is brittle and bad.

For OTUs, this issue is now addressed by get_taxa() returning the document-local OTU IDs as one of the columns in the returned data.frame. We can ask get_characters() to include OTU IDs as a column, which enables mapping rows in the data matrix to OTU annotations. There isn't yet a way to ask get_characters() to include column (= ) IDs.

Is that understandable?

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/137#issuecomment-154099468.

http://carlboettiger.info

xu-hong commented 8 years ago

Thanks! But I cannot access get_level() function, on the fix-get-characters branch. I guess it's not exported?

cboettig commented 8 years ago

Right, not exported yet since not documented or fully tested, by you can always do RNeXML:::get_level

On Fri, Nov 6, 2015, 12:01 PM Hong Xu notifications@github.com wrote:

Thanks! But I cannot access get_level() function, on the fix-get-characters branch. I guess it's not exported?

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/137#issuecomment-154517136.

http://carlboettiger.info

xu-hong commented 8 years ago

Ah! Thanks. Yeah, I believe it produces the result we wanted to see.

> RNeXML:::get_level(ne, level="characters/format/char")
Source: local data frame [3 x 6]

                                 states             id                                           label           about xsi.type format
                                  (chr)          (chr)                                           (chr)           (chr)    (lgl)  (chr)
1 sa75ef9ac-e74e-4015-846d-27d793868951 UBERON_2002002 anterior distal serration of pectoral fin spine #UBERON_2002002       NA   root
2 s99d94a8b-9bab-4b56-990a-a3fcc85900f4 UBERON_2001788                                   pelvic splint #UBERON_2001788       NA   root
3 sb29f0f18-addb-4e9b-bcef-833065cba124 UBERON_2002001        anterior dentation of pectoral fin spine #UBERON_2002001       NA   root

If I am understanding it right - the id column from above result corresponds to the char column in the following data frame right?

> RNeXML::get_metadata(ne, level="characters/format/char")
Source: local data frame [3 x 6]

     id             rel                                          href     xsi.type           char format
  (lgl)           (chr)                                         (chr)        (chr)          (chr)  (chr)
1    NA obo:IAO_0000219 http://purl.obolibrary.org/obo/UBERON_2002002 ResourceMeta UBERON_2002002   root
2    NA obo:IAO_0000219 http://purl.obolibrary.org/obo/UBERON_2001788 ResourceMeta UBERON_2001788   root
3    NA obo:IAO_0000219 http://purl.obolibrary.org/obo/UBERON_2002001 ResourceMeta UBERON_2002001   root
cboettig commented 8 years ago

Correct, id is the id attribute of the specified level (char in this case). All col names (including id are just the attribute names from that level, except the last column which has the id of the parent, and is named using the parent's element.

Perhaps it would be cleaner to go ahead and rename the id column with the element name? I guess that would make joins easier, though it might not be obvious that these are id attributes?

On Fri, Nov 6, 2015, 1:31 PM Hong Xu notifications@github.com wrote:

Ah! Thanks. Yeah, I believe it produces the result we wanted to see.

RNeXML:::get_level(ne, level="characters/format/char") Source: local data frame [3 x 6]

                             states             id                                           label           about xsi.type format
                              (chr)          (chr)                                           (chr)           (chr)    (lgl)  (chr)

1 sa75ef9ac-e74e-4015-846d-27d793868951 UBERON_2002002 anterior distal serration of pectoral fin spine #UBERON_2002002 NA root 2 s99d94a8b-9bab-4b56-990a-a3fcc85900f4 UBERON_2001788 pelvic splint #UBERON_2001788 NA root 3 sb29f0f18-addb-4e9b-bcef-833065cba124 UBERON_2002001 anterior dentation of pectoral fin spine #UBERON_2002001 NA root

If I am understanding it right, the id column from above result corresponds to the char column in the following data frame right?

RNeXML::get_metadata(ne, level="characters/format/char") Source: local data frame [3 x 6]

 id             rel                                          href     xsi.type           char format

(lgl) (chr) (chr) (chr) (chr) (chr) 1 NA obo:IAO_0000219 http://purl.obolibrary.org/obo/UBERON_2002002 ResourceMeta UBERON_2002002 root 2 NA obo:IAO_0000219 http://purl.obolibrary.org/obo/UBERON_2001788 ResourceMeta UBERON_2001788 root 3 NA obo:IAO_0000219 http://purl.obolibrary.org/obo/UBERON_2002001 ResourceMeta UBERON_2002001 root

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/137#issuecomment-154547701.

http://carlboettiger.info

cboettig commented 8 years ago

Okay, should be all fixed in master now. Also has get_level exposed to the user.