How to access the taxon IDs if the NeXML source contains them

hlapp commented 8 years ago

Perhaps there is an easy way in the API already - how does one get at the the taxon IDs as annotated, for example, in the form of dwc:taxonID metadata? Like here in the NeXML produced by the Phenoscape API:

<otu id="VTO_0061495" label="Ictalurus australis" about="#VTO_0061495">
      <meta xsi:type="ResourceMeta" rel="dwc:taxonID" href="http://purl.obolibrary.org/obo/VTO_0061495" />
</otu>

cc @xu-hong.

cboettig commented 8 years ago

A couple ways, depending on how you like. You can always query the S4 object structure, as described in the S4 vignette (https://cran.r-project.org/web/packages/RNeXML/vignettes/S4.html), which is the natural R way. You can query by xpath, but that's less easy in RNeXML since we assumed few users would know xpath (or if they did, would just be doing the parsing directly with XML library)

Um, stupid question just to be clear: In your example, which is the id? The value of the id attribute on the otu element? or the href of the subsequent meta element? (though of course these are related).

For RDFa meta elements, there's more tooling, including first generating the corresponding RDF-XML document and then performing full SPARQL queries if you like, (as well as XML/Xpath-based queries of the RDF-XML). HTH,

hlapp commented 8 years ago

A couple ways, depending on how you like. You can always query the S4 object structure, as described in the S4 vignette (https://cran.r-project.org/web/packages/RNeXML/vignettes/S4.html), which is the natural R way.

Isn't the S4 way to use API methods rather than accessing the object structure directly? (Otherwise, why even use an S4 object?)

Um, stupid question just to be clear: In your example, which is the id? The value of the id attribute on the otu element? or the href of the subsequent meta element? (though of course these are related).

The dwc:taxonID annotation. I.e., the ID is the object of the dwc:taxonID relationship, or http://purl.obolibrary.org/obo/VTO_0061495 in the example above.

The id attribute of the <otu/> element is, I think, not useful, because it's local and not expected to roundtrip. In the Phenoscape-emitted XML is has a nice value, but it's really just there to tie the elements together in the XML document. That's why it's the annotation that's important.

For RDFa meta elements, there's more tooling, including first generating the corresponding RDF-XML document and then performing full SPARQL queries if you like, (as well as XML/Xpath-based queries of the RDF-XML)

Yes, but I find this rather dissatisfying as an answer for how to get at one of the most important pieces of information about an OTU. I do think that there should be an API method for it.

cboettig commented 8 years ago

Hi @hlapp,

Yes, as I mentioned there are several ways, most of which are in the documentation. For instance, you can get the metadata from otu level options with the get_metadata(nex, level="otu") function, which is an S4 method (as opposed to S4 subsetting). Let me know if that gets what you want.

Sorry to be disappointing, without understanding the use case and the user's preferences better it's hard to know what is the best way to get what. e.g. someone who really wants to make semantic sparql queries might find the R-level API functions dissatisfying. I don't have much intuition about what is the "most important information about X", but we have tried to document the different methods. It seemed silly to just write S4 accessor methods for everything so these focus on things like the metadata elements. Suggestions and PRs always welcome.

cboettig commented 8 years ago

(Whoops, forgot the link to the metadata vignette: https://cran.r-project.org/web/packages/RNeXML/vignettes/metadata.html)

hlapp commented 8 years ago

The use case is to extract a table that maps OTU labels (which are the row labels in the data.frame returned by get_characters()) to corresponding taxon IDs. According to the list made by @sckott in ropensci/traits#38, ID is a candidate for being fairly common among trait data API packages, @xu-hong and I were brainstorming how to obtain (and return) those.

Does get_metadata(nex,level="otu") return the metadata in a predictable order? Would they be in the same order as the matrix (data.frame) row labels, or the same order returned by get_taxa()? If not, how would one establish the mapping?

hlapp commented 8 years ago

Does get_metadata(nex,level="otu") return the metadata in a predictable order? Would they be in the same order as the matrix (data.frame) row labels, or the same order returned by get_taxa()?

@cboettig or @sckott - seems that the order is not the same, as per what @xu-hong just tried.

sckott commented 8 years ago

@hlapp not sure, example to play with?

hlapp commented 8 years ago

The example query in the Phenoscape Apiary docs for OntroTrace will do: http://kb.phenoscape.org/api/ontotrace?taxon=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FVTO_0036217%3E&entity=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050%3E%20some%20%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0008897%3E

cboettig commented 8 years ago

It looks like get_metadata return order is that given by XPath for the matching node set, while get_taxa is just using R's lapply over the structure. In both cases though I would have thought the order would just be that in which these elements appear in the XML file; is that not the case? Or is that not the desired behavior? Not sure if there was a good reason for get_metadata to use XPath here in the first place.

@hlapp can you give us a bit more detail as to what would be the most desired behavior for the return objects of get_metadata and get_taxa and maybe we can clean these methods up a bit?

sckott commented 8 years ago

Looks like get_characters() is the only one that doesn't return them in order

x <- nexml_read("http://kb.phenoscape.org/api/ontotrace?taxon=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FVTO_0036217%3E&entity=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050%3E%20some%20%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0008897%3E")
get_characters(x)
#>                         pelvic splint anterior dentation of pectoral fin spine anterior distal serration of pectoral fin spine
#> Ictalurus pricei                    1                                        1                                               1
#> Ictalurus lupus                     1                                        1                                               1
#> Ictalurus balsanus                  1                                        0                                            <NA>
#> Ictalurus furcatus                  1                                        0                                               1
#> Ictalurus punctatus              <NA>                                        1                                               1
#> Ictalurus australis                 1                                        1                                               1
#> Ictalurus sp. (Mo 1991)             0                                     <NA>                                            <NA>
#> Ictalurus dugesii                   1                                     <NA>                                               1
#> Ictalurus mexicanus                 1                                     <NA>                                               1

get_metadata(x, level = "otu")
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036225"
#> 
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0061498"
#> 
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0061495"
#> 
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036221"
#> 
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036218"
#> 
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036223"
#> 
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036220"
#> 
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0061497"
#> 
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0061496"

get_taxa(x)
#> [1] "Ictalurus punctatus"     "Ictalurus mexicanus"     "Ictalurus australis"     "Ictalurus balsanus" "Ictalurus pricei"        "Ictalurus furcatus"      "Ictalurus lupus"         "Ictalurus dugesii" "Ictalurus sp. (Mo 1991)"

hlapp commented 8 years ago

So is the real issue that the rows in the matrix need to be reordered to be consistent with what the other methods return?

hlapp commented 8 years ago

@hlapp can you give us a bit more detail as to what would be the most desired behavior for the return objects of get_metadata and get_taxa and maybe we can clean these methods up a bit?

Does the link to the issue on the rphenoscape tracker help establish enough context as to where this is coming from? Essentially the use-case is to create mapping tables from name to identifier(s). To apply these painlessly and with the least gotcha's, the names in the mapping table should be in the same order as everywhere else. Does that make sense?

The alternative is to get metadata back directly linked to what they annotate.

One way or another, there needs to be good way to establish that mapping.

cboettig commented 8 years ago

@hlapp yeah, the rphenoscape example is helpful, but I'm still processing the details here.

It's not clear to me why the order in which the characters are returned in the table matters; in that generally we expect methods that operate on data.frames to be agnostic of the order of the rows. I thought NeXML had the same philosophy that in general it encodes all data explicitly in fields, rather than implicitly through structure (such as ordering or nestedness).

Do I have this wrong?

It seems like the problems with the above methods are not so much the order as the lack of an additional id column. My reading of the phenoscape issue is that we want three data.frames as the return objects. get_taxa and get_metadata are currently returning a character string and a named list, respectively, which seems non-ideal.

I think they should return data.frames, and I think it sounds like they need another column that contains the data you refer to as being represented by the ordering, but I'm not quite sure what that is. (e.g. I suppose it is what they are annotating, but be really explicit for me: is it that the id of the element, or the label, or something else?)

Anyway, I completely agree that we need a good way to establish the mapping and that the current return objects are failing to preserve that information.

hlapp commented 8 years ago

It's not clear to me why the order in which the characters are returned in the table matters; in that generally we expect methods that operate on data.frames to be agnostic of the order of the rows.

Isn't it pretty common in R to use the index for subsetting? Say I wanted to subset the matrix to keep all rows for taxa that have an identifier:

data <- get_characters(nexml)
ids <- get_metadata(nexml, level="otu") # this won't work this way right now, of course
data <- data[!is.na(ids), ]

Say you have a function pk_is_descendant(x1, x2) that returns TRUE if x2 is a descendent of x1 and FALSE otherwise:

data <- get_characters(nexml)
ids <- get_metadata(nexml, level="otu") # this won't work this way right now, of course
data <- data[pk_is_descendant("Mammalia", ids), ]

Does that make sense?

hlapp commented 8 years ago

My reading of the phenoscape issue is that we want three data.frames as the return objects. get_taxa and get_metadata are currently returning a character string and a named list, respectively, which seems non-ideal.

I think they should return data.frames, and I think it sounds like they need another column that contains the data you refer to as being represented by the ordering, but I'm not quite sure what that is. (e.g. I suppose it is what they are annotating, but be really explicit for me: is it that the id of the element, or the label, or something else?)

The piece that is being used elsewhere for row and column labels. For example, the data matrix uses apparently the taxon labels for its row labels, and the character labels for its column labels.

cboettig commented 8 years ago

The piece that is being used elsewhere for row and column labels. For example, the data matrix uses apparently the taxon labels for its row labels, and the character labels for its column labels.

Yup. I think that was a bad choice though. The row labels should be a column, and the rows should be unlabelled. This is more consistent with database design and more robust for manipulation.

Yes, people subset in R with index vectors, but typically only when the index vector is constructed from a column of the data.frame you are subsetting (where clearly you cannot have the problem of different orderings).

Your example is very helpful, but it sounds like get_characters() should be returning a column for ids as well as a column (rather than row-labels) for taxon label, and that get_metadata(nexml, level="otu") should be returning a data.frame with id as a column (or is it taxon label that we want as the key?), and then columns for attribute and value (the rel and href of the meta element). Then you could join the metadata table and the characters table, and filter, subset, etc, intelligently (with nice dplyr functions or standard index subsetting) and not have to ever worry about ordering of rows. Does that sound right to you?

xu-hong commented 8 years ago

Hi @cboettig, I agree that get_characters() should return a data.frame that has taxon labels as a column, instead of row labels. But perhaps the data.frame should not include a column for ids - my understanding is that ids are not always the information the users need to know when they ask for OntoTrace matrix?

idsshould be found separately in the data.framereturned by get_metadata(nexml, level="otu"), along with taxon labels (as the key) and other values, as you mentioned. And the metadata table can be joined with the ontotrace matrix on taxon labels.

@hlapp Do you agree?

hlapp commented 8 years ago

Yes, I agree, that seems more natural. That said, it would be easy enough to splice out the ID column in rphenoscape (and to construct a separate data.frame with metadata mapped to taxon labels) before returning the result to the user, in case @cboettig would rather put it into the data.frame returned by get_characters().

cboettig commented 8 years ago

Thanks for the advice here, very helpful. I'm still a tad leary of using labels instead of ids as keys for indexing and joining tables, (I'm guessing labels can have more weird UTF-8 chars then ids, and thus cause trouble if a user has not configured locales sensibly), isn't that why we have ids in the first place?

Are we always guaranteed to have both id and label elements available (i.e. are they both required by the schema? guess I should know that...)

rvosa commented 8 years ago

No! The label attribute is optional.

cboettig commented 8 years ago

@rvosa very good, that makes much more sense. I'll return ids as the key column for each of the data.frames

hlapp commented 8 years ago

Which IDs? The ones in the id="" attribute? That would, I think, be a bad choice, because they are ephemeral, not expected to roundtrip, and local to the document. Or in other words, a sequential numbering would just be as good, but would not give the impression that any assumptions could be made about the ID.

cboettig commented 8 years ago

Right, they are local to the document, but they'd still be better than assuming the row order with no id's at all? What is the purpose of those ids? What would you recommend we use? Using an optional element seems unwise, right?

On Sat, Oct 17, 2015, 3:55 PM Hilmar Lapp notifications@github.com wrote:

Which IDs? The ones in the id="" attribute? That would, I think, be a bad choice, because they are ephemeral, not expected to roundtrip, and local to the document. Or in other words, a sequential numbering would just be as good, but would not give the impression that any assumptions could be made about the ID.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/129#issuecomment-148958667.

http://carlboettiger.info

rvosa commented 8 years ago

Also, there is no requirement that all otu attributes, even if they're there, are unique.

@hlapp do you think it would ever be problematic that the ids are ephemeral? I mean, in practice? They are unique keys for managing referential integrity within the document, anything else you should use an annotation for (example: any kind of database ID).

hlapp commented 8 years ago

They are unique keys for managing referential integrity within the document

Exactly (and @cboettig, this is the answer to your question - they are in essence local primary keys, which are never really useful to expose or export to anything else, including not from XML documents) . So you might as well use a sequential numbering - it's unique for each row, and will obviously be local to a data matrix (whereas that fact might be much less obvious from the XML doc's primary keys).

So I know there's been concern with and objection to using the row order of the data matrix, but in essence we're back to that.

rvosa commented 8 years ago

I must apologize for not having read the thread closely enough. If @hlapp's use case is to map names to taxon ID annotations then this sounds to me like a table with two columns: one with the label attribute - being the place where names go - and one with the value of the taxon ID annotation (in this case a URI).

The consequence is that the names column may legally have empty or non-unique values but that, to me, seems inevitable considering that names are not required, unique, primary keys in the real world.

Programmatically we therefore can't rely on them to act like primary keys (or hash keys or whatever). But I don't think that was a requirement anyway for @hlapp, right? The converse may be true though: the taxon IDs are globally unique (so that column might be treated as such), and may have zero or more labels attached to it. Op Sun, 18 Oct 2015 om 02:05 schreef Hilmar Lapp notifications@github.com

They are unique keys for managing referential integrity within the document

Exactly (and @cboettig https://github.com/cboettig, this is the answer to your question - they are in essence local primary keys, which are never really useful to expose or export to anything else, including not from XML documents) . So you might as well use a sequential numbering - it's unique for each row, and will obviously be local to a data matrix (whereas that fact might be much less obvious from the XML doc's primary keys).

So I know there's been concern with and objection to using the row order of the data matrix, but in essence we're back to that.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/129#issuecomment-148962015.

hlapp commented 8 years ago

Just as an FYI, now that @balhoff implemented identifier annotations for character definitions (see phenoscape/phenoscape-kb-services#20), we can see that the order in which get_metadata(nex, level="char") returns results isn't the same as the order of columns in the matrix returned by get_characters() either.

So right now, get_metadata() is kind of useless for getting at those metadata. And I agree that simply fixing the order doesn't cut it - if an otu or char element lacks an annotation, then that fact isn't represented by an NA in the list returned by get_metadata().

Perhaps this is a good point to arrange a conference call to move this issue forward?

cboettig commented 8 years ago

Sounds good to me.

I've implemented a new version of get_metadata now on the drop-nex branch, which returns a data.frame that contains as its columns the attribute values of any meta elements at the desired level, along with the id of the parent element, e.g. this NeXML gives:

 > get_metadata(nex, "otu")
Source: local data frame [959 x 5]

      id             rel                                              href         xsi.type parent_id
   (chr)           (chr)                                             (chr)           (fctr)     (chr)
1    ma4 concept:toTaxon            http://ncbi.nlm.nih.gov/taxonomy/54135 nex:ResourceMeta     ou475
2    ma5    concept:rank http://rs.tdwg.org/ontology/voc/TaxonRank#Species nex:ResourceMeta     ou475
3    ma6 rdfs:subClassOf            http://ncbi.nlm.nih.gov/taxonomy/54134 nex:ResourceMeta     ou475
4    ma8 concept:toTaxon           http://ncbi.nlm.nih.gov/taxonomy/122248 nex:ResourceMeta     ou465
5    ma9    concept:rank http://rs.tdwg.org/ontology/voc/TaxonRank#Species nex:ResourceMeta     ou465
6   ma10 rdfs:subClassOf           http://ncbi.nlm.nih.gov/taxonomy/122247 nex:ResourceMeta     ou465
7   ma12 concept:toTaxon            http://ncbi.nlm.nih.gov/taxonomy/30590 nex:ResourceMeta     ou578
8   ma13    concept:rank http://rs.tdwg.org/ontology/voc/TaxonRank#Species nex:ResourceMeta     ou578
9   ma14 rdfs:subClassOf             http://ncbi.nlm.nih.gov/taxonomy/9499 nex:ResourceMeta     ou578
10  ma16 concept:toTaxon             http://ncbi.nlm.nih.gov/taxonomy/9502 nex:ResourceMeta     ou484
..   ...             ...                                               ...              ...       ...

No idea if this is wise or not, but shows what I am thinking. I'd like get_taxa to return a similar data.frame, and then include the otu attribute values in get_characters. I think this permits an intelligent join for the desired tables but would be good to discuss. In any event, we should be able to do something much more useful than the current methods, which really don't help much for this or any other non-trivial use case.

hlapp commented 8 years ago

I'd like get_taxa() to return a similar data.frame, and then include the otu attribute values in get_characters().

I guess I'm curious what you mean by similar data.frame. ID, label, and? And for get_characters(), are you thinking about returning a data.frame instead of a matrix? Not being sure what data structure you have in mind, I'll just note that I'd be wary of sticking too many columns into the matrix (or data.frame) that aren't part of the character matrix.

cboettig commented 8 years ago

Good questions, and please push back if I'm saying something silly; you, @xu-hong and @balhoff have a better idea than me about the actual use cases here.

For get_taxa, I'm thinking of returning the attribute values of the otu elements -- really that's just id and label (but could include about and xsi:type). For reference, this would probably also include a column with the parent id (e.g. which would identify if the otu values came from more than one otus block).

For the characters matrix, I would probably only add the value of the otu attribute to the <row> element. It seems like this is the right thing for joining with the other tables, rather than label which need not be unique. Does that make sense?

@rvosa does the label attribute of a <row> element have to match the label attribute of the corresponding <otu> element (that is, the element whose id corresponds to the row's otu attribute)?

cboettig commented 8 years ago

@hlapp @xu-hong others what do you think of this approach (now implemented on the drop-nex branch): https://github.com/ropensci/RNeXML/blob/611de7caa9fc8335b82b29f664574b154ec09d9f/inst/examples/merge_data.md

Note that I've left the get_characters just returning the labels and have done that join on labels instead of id, though I'm still not sure if that's ideal or not, particularly since label need not be required. I think get_characters might be better off returning label and otu id information to be explicit, but perhaps not.

cboettig commented 8 years ago

So I think this issue is now addressed by PR #133 and the fix to #135. Please highlight any remaining problems as new issues so we don't lose track of them.

ropensci / RNeXML

How to access the taxon IDs if the NeXML source contains them #129