Closed hlapp closed 8 years ago
A couple ways, depending on how you like. You can always query the S4 object structure, as described in the S4 vignette (https://cran.r-project.org/web/packages/RNeXML/vignettes/S4.html), which is the natural R way. You can query by xpath, but that's less easy in RNeXML since we assumed few users would know xpath (or if they did, would just be doing the parsing directly with XML library)
Um, stupid question just to be clear: In your example, which is the id? The value of the id attribute on the otu element? or the href of the subsequent meta element? (though of course these are related).
For RDFa meta elements, there's more tooling, including first generating the corresponding RDF-XML document and then performing full SPARQL queries if you like, (as well as XML/Xpath-based queries of the RDF-XML). HTH,
A couple ways, depending on how you like. You can always query the S4 object structure, as described in the S4 vignette (https://cran.r-project.org/web/packages/RNeXML/vignettes/S4.html), which is the natural R way.
Isn't the S4 way to use API methods rather than accessing the object structure directly? (Otherwise, why even use an S4 object?)
Um, stupid question just to be clear: In your example, which is the id? The value of the id attribute on the otu element? or the href of the subsequent meta element? (though of course these are related).
The dwc:taxonID
annotation. I.e., the ID is the object of the dwc:taxonID
relationship, or http://purl.obolibrary.org/obo/VTO_0061495 in the example above.
The id
attribute of the <otu/>
element is, I think, not useful, because it's local and not expected to roundtrip. In the Phenoscape-emitted XML is has a nice value, but it's really just there to tie the elements together in the XML document. That's why it's the annotation that's important.
For RDFa meta elements, there's more tooling, including first generating the corresponding RDF-XML document and then performing full SPARQL queries if you like, (as well as XML/Xpath-based queries of the RDF-XML)
Yes, but I find this rather dissatisfying as an answer for how to get at one of the most important pieces of information about an OTU. I do think that there should be an API method for it.
Hi @hlapp,
Yes, as I mentioned there are several ways, most of which are in the documentation. For instance, you can get the metadata from otu
level options with the get_metadata(nex, level="otu")
function, which is an S4 method (as opposed to S4 subsetting). Let me know if that gets what you want.
Sorry to be disappointing, without understanding the use case and the user's preferences better it's hard to know what is the best way to get what. e.g. someone who really wants to make semantic sparql queries might find the R-level API functions dissatisfying. I don't have much intuition about what is the "most important information about X", but we have tried to document the different methods. It seemed silly to just write S4 accessor methods for everything so these focus on things like the metadata elements. Suggestions and PRs always welcome.
(Whoops, forgot the link to the metadata vignette: https://cran.r-project.org/web/packages/RNeXML/vignettes/metadata.html)
The use case is to extract a table that maps OTU labels (which are the row labels in the data.frame returned by get_characters()
) to corresponding taxon IDs. According to the list made by @sckott in ropensci/traits#38, ID is a candidate for being fairly common among trait data API packages, @xu-hong and I were brainstorming how to obtain (and return) those.
Does get_metadata(nex,level="otu")
return the metadata in a predictable order? Would they be in the same order as the matrix (data.frame) row labels, or the same order returned by get_taxa()
? If not, how would one establish the mapping?
Does get_metadata(nex,level="otu") return the metadata in a predictable order? Would they be in the same order as the matrix (data.frame) row labels, or the same order returned by get_taxa()?
@cboettig or @sckott - seems that the order is not the same, as per what @xu-hong just tried.
@hlapp not sure, example to play with?
The example query in the Phenoscape Apiary docs for OntroTrace will do: http://kb.phenoscape.org/api/ontotrace?taxon=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FVTO_0036217%3E&entity=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050%3E%20some%20%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0008897%3E
It looks like get_metadata
return order is that given by XPath for the matching node set, while get_taxa
is just using R's lapply
over the structure. In both cases though I would have thought the order would just be that in which these elements appear in the XML file; is that not the case? Or is that not the desired behavior? Not sure if there was a good reason for get_metadata
to use XPath here in the first place.
@hlapp can you give us a bit more detail as to what would be the most desired behavior for the return objects of get_metadata
and get_taxa
and maybe we can clean these methods up a bit?
Looks like get_characters()
is the only one that doesn't return them in order
x <- nexml_read("http://kb.phenoscape.org/api/ontotrace?taxon=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FVTO_0036217%3E&entity=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050%3E%20some%20%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0008897%3E")
get_characters(x)
#> pelvic splint anterior dentation of pectoral fin spine anterior distal serration of pectoral fin spine
#> Ictalurus pricei 1 1 1
#> Ictalurus lupus 1 1 1
#> Ictalurus balsanus 1 0 <NA>
#> Ictalurus furcatus 1 0 1
#> Ictalurus punctatus <NA> 1 1
#> Ictalurus australis 1 1 1
#> Ictalurus sp. (Mo 1991) 0 <NA> <NA>
#> Ictalurus dugesii 1 <NA> 1
#> Ictalurus mexicanus 1 <NA> 1
get_metadata(x, level = "otu")
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036225"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0061498"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0061495"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036221"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036218"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036223"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0036220"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0061497"
#>
#> $`dwc:taxonID`
#> [1] "http://purl.obolibrary.org/obo/VTO_0061496"
get_taxa(x)
#> [1] "Ictalurus punctatus" "Ictalurus mexicanus" "Ictalurus australis" "Ictalurus balsanus" "Ictalurus pricei" "Ictalurus furcatus" "Ictalurus lupus" "Ictalurus dugesii" "Ictalurus sp. (Mo 1991)"
So is the real issue that the rows in the matrix need to be reordered to be consistent with what the other methods return?
@hlapp can you give us a bit more detail as to what would be the most desired behavior for the return objects of get_metadata and get_taxa and maybe we can clean these methods up a bit?
Does the link to the issue on the rphenoscape tracker help establish enough context as to where this is coming from? Essentially the use-case is to create mapping tables from name to identifier(s). To apply these painlessly and with the least gotcha's, the names in the mapping table should be in the same order as everywhere else. Does that make sense?
The alternative is to get metadata back directly linked to what they annotate.
One way or another, there needs to be good way to establish that mapping.
@hlapp yeah, the rphenoscape example is helpful, but I'm still processing the details here.
It's not clear to me why the order in which the characters are returned in the table matters; in that generally we expect methods that operate on data.frames to be agnostic of the order of the rows. I thought NeXML had the same philosophy that in general it encodes all data explicitly in fields, rather than implicitly through structure (such as ordering or nestedness).
Do I have this wrong?
It seems like the problems with the above methods are not so much the order as the lack of an additional id column. My reading of the phenoscape issue is that we want three data.frames as the return objects. get_taxa
and get_metadata
are currently returning a character string and a named list, respectively, which seems non-ideal.
I think they should return data.frames, and I think it sounds like they need another column that contains the data you refer to as being represented by the ordering, but I'm not quite sure what that is. (e.g. I suppose it is what they are annotating, but be really explicit for me: is it that the id of the element, or the label, or something else?)
Anyway, I completely agree that we need a good way to establish the mapping and that the current return objects are failing to preserve that information.
It's not clear to me why the order in which the characters are returned in the table matters; in that generally we expect methods that operate on data.frames to be agnostic of the order of the rows.
Isn't it pretty common in R to use the index for subsetting? Say I wanted to subset the matrix to keep all rows for taxa that have an identifier:
data <- get_characters(nexml)
ids <- get_metadata(nexml, level="otu") # this won't work this way right now, of course
data <- data[!is.na(ids), ]
Say you have a function pk_is_descendant(x1, x2)
that returns TRUE if x2 is a descendent of x1 and FALSE otherwise:
data <- get_characters(nexml)
ids <- get_metadata(nexml, level="otu") # this won't work this way right now, of course
data <- data[pk_is_descendant("Mammalia", ids), ]
Does that make sense?
My reading of the phenoscape issue is that we want three data.frames as the return objects. get_taxa and get_metadata are currently returning a character string and a named list, respectively, which seems non-ideal.
I think they should return data.frames, and I think it sounds like they need another column that contains the data you refer to as being represented by the ordering, but I'm not quite sure what that is. (e.g. I suppose it is what they are annotating, but be really explicit for me: is it that the id of the element, or the label, or something else?)
The piece that is being used elsewhere for row and column labels. For example, the data matrix uses apparently the taxon labels for its row labels, and the character labels for its column labels.
The piece that is being used elsewhere for row and column labels. For example, the data matrix uses apparently the taxon labels for its row labels, and the character labels for its column labels.
Yup. I think that was a bad choice though. The row labels should be a column, and the rows should be unlabelled. This is more consistent with database design and more robust for manipulation.
Yes, people subset in R with index vectors, but typically only when the index vector is constructed from a column of the data.frame you are subsetting (where clearly you cannot have the problem of different orderings).
Your example is very helpful, but it sounds like get_characters()
should be returning a column for ids
as well as a column (rather than row-labels) for taxon label
, and that get_metadata(nexml, level="otu")
should be returning a data.frame
with id as a column (or is it taxon label that we want as the key?), and then columns for attribute and value (the rel
and href
of the meta element). Then you could join the metadata table and the characters table, and filter, subset, etc, intelligently (with nice dplyr
functions or standard index subsetting) and not have to ever worry about ordering of rows. Does that sound right to you?
Hi @cboettig,
I agree that get_characters()
should return a data.frame
that has taxon labels as a column, instead of row labels. But perhaps the data.frame
should not include a column for ids
- my understanding is that ids
are not always the information the users need to know when they ask for OntoTrace matrix
?
ids
should be found separately in the data.frame
returned by get_metadata(nexml, level="otu")
, along with taxon labels (as the key) and other values, as you mentioned. And the metadata table can be joined with the ontotrace matrix on taxon labels.
@hlapp Do you agree?
Yes, I agree, that seems more natural. That said, it would be easy enough to splice out the ID column in rphenoscape (and to construct a separate data.frame with metadata mapped to taxon labels) before returning the result to the user, in case @cboettig would rather put it into the data.frame returned by get_characters()
.
Thanks for the advice here, very helpful. I'm still a tad leary of using labels instead of ids as keys for indexing and joining tables, (I'm guessing labels can have more weird UTF-8 chars then ids, and thus cause trouble if a user has not configured locales sensibly), isn't that why we have ids in the first place?
Are we always guaranteed to have both id and label elements available (i.e. are they both required by the schema? guess I should know that...)
No! The label attribute is optional.
@rvosa very good, that makes much more sense. I'll return ids as the key column for each of the data.frames
Which IDs? The ones in the id=""
attribute? That would, I think, be a bad choice, because they are ephemeral, not expected to roundtrip, and local to the document. Or in other words, a sequential numbering would just be as good, but would not give the impression that any assumptions could be made about the ID.
Right, they are local to the document, but they'd still be better than assuming the row order with no id's at all? What is the purpose of those ids? What would you recommend we use? Using an optional element seems unwise, right?
On Sat, Oct 17, 2015, 3:55 PM Hilmar Lapp notifications@github.com wrote:
Which IDs? The ones in the id="" attribute? That would, I think, be a bad choice, because they are ephemeral, not expected to roundtrip, and local to the document. Or in other words, a sequential numbering would just be as good, but would not give the impression that any assumptions could be made about the ID.
— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/129#issuecomment-148958667.
Also, there is no requirement that all otu attributes, even if they're there, are unique.
@hlapp do you think it would ever be problematic that the ids are ephemeral? I mean, in practice? They are unique keys for managing referential integrity within the document, anything else you should use an annotation for (example: any kind of database ID).
They are unique keys for managing referential integrity within the document
Exactly (and @cboettig, this is the answer to your question - they are in essence local primary keys, which are never really useful to expose or export to anything else, including not from XML documents) . So you might as well use a sequential numbering - it's unique for each row, and will obviously be local to a data matrix (whereas that fact might be much less obvious from the XML doc's primary keys).
So I know there's been concern with and objection to using the row order of the data matrix, but in essence we're back to that.
I must apologize for not having read the thread closely enough. If @hlapp's use case is to map names to taxon ID annotations then this sounds to me like a table with two columns: one with the label attribute - being the place where names go - and one with the value of the taxon ID annotation (in this case a URI).
The consequence is that the names column may legally have empty or non-unique values but that, to me, seems inevitable considering that names are not required, unique, primary keys in the real world.
Programmatically we therefore can't rely on them to act like primary keys (or hash keys or whatever). But I don't think that was a requirement anyway for @hlapp, right? The converse may be true though: the taxon IDs are globally unique (so that column might be treated as such), and may have zero or more labels attached to it. Op Sun, 18 Oct 2015 om 02:05 schreef Hilmar Lapp notifications@github.com
They are unique keys for managing referential integrity within the document
Exactly (and @cboettig https://github.com/cboettig, this is the answer to your question - they are in essence local primary keys, which are never really useful to expose or export to anything else, including not from XML documents) . So you might as well use a sequential numbering - it's unique for each row, and will obviously be local to a data matrix (whereas that fact might be much less obvious from the XML doc's primary keys).
So I know there's been concern with and objection to using the row order of the data matrix, but in essence we're back to that.
— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/129#issuecomment-148962015.
Just as an FYI, now that @balhoff implemented identifier annotations for character definitions (see phenoscape/phenoscape-kb-services#20), we can see that the order in which get_metadata(nex, level="char")
returns results isn't the same as the order of columns in the matrix returned by get_characters()
either.
So right now, get_metadata()
is kind of useless for getting at those metadata. And I agree that simply fixing the order doesn't cut it - if an otu
or char
element lacks an annotation, then that fact isn't represented by an NA
in the list returned by get_metadata()
.
Perhaps this is a good point to arrange a conference call to move this issue forward?
Sounds good to me.
I've implemented a new version of get_metadata now on the drop-nex
branch, which returns a data.frame that contains as its columns the attribute values of any meta elements at the desired level, along with the id of the parent element, e.g. this NeXML gives:
> get_metadata(nex, "otu")
Source: local data frame [959 x 5]
id rel href xsi.type parent_id
(chr) (chr) (chr) (fctr) (chr)
1 ma4 concept:toTaxon http://ncbi.nlm.nih.gov/taxonomy/54135 nex:ResourceMeta ou475
2 ma5 concept:rank http://rs.tdwg.org/ontology/voc/TaxonRank#Species nex:ResourceMeta ou475
3 ma6 rdfs:subClassOf http://ncbi.nlm.nih.gov/taxonomy/54134 nex:ResourceMeta ou475
4 ma8 concept:toTaxon http://ncbi.nlm.nih.gov/taxonomy/122248 nex:ResourceMeta ou465
5 ma9 concept:rank http://rs.tdwg.org/ontology/voc/TaxonRank#Species nex:ResourceMeta ou465
6 ma10 rdfs:subClassOf http://ncbi.nlm.nih.gov/taxonomy/122247 nex:ResourceMeta ou465
7 ma12 concept:toTaxon http://ncbi.nlm.nih.gov/taxonomy/30590 nex:ResourceMeta ou578
8 ma13 concept:rank http://rs.tdwg.org/ontology/voc/TaxonRank#Species nex:ResourceMeta ou578
9 ma14 rdfs:subClassOf http://ncbi.nlm.nih.gov/taxonomy/9499 nex:ResourceMeta ou578
10 ma16 concept:toTaxon http://ncbi.nlm.nih.gov/taxonomy/9502 nex:ResourceMeta ou484
.. ... ... ... ... ...
No idea if this is wise or not, but shows what I am thinking. I'd like get_taxa
to return a similar data.frame
, and then include the otu
attribute values in get_characters
. I think this permits an intelligent join for the desired tables but would be good to discuss. In any event, we should be able to do something much more useful than the current methods, which really don't help much for this or any other non-trivial use case.
I'd like
get_taxa()
to return a similardata.frame
, and then include theotu
attribute values inget_characters()
.
I guess I'm curious what you mean by similar data.frame. ID, label, and? And for get_characters()
, are you thinking about returning a data.frame instead of a matrix? Not being sure what data structure you have in mind, I'll just note that I'd be wary of sticking too many columns into the matrix (or data.frame) that aren't part of the character matrix.
Good questions, and please push back if I'm saying something silly; you, @xu-hong and @balhoff have a better idea than me about the actual use cases here.
For get_taxa
, I'm thinking of returning the attribute values of the otu
elements -- really that's just id
and label
(but could include about
and xsi:type
). For reference, this would probably also include a column with the parent id (e.g. which would identify if the otu
values came from more than one otus
block).
For the characters matrix, I would probably only add the value of the otu
attribute to the <row>
element. It seems like this is the right thing for joining with the other tables, rather than label
which need not be unique. Does that make sense?
@rvosa does the label
attribute of a <row>
element have to match the label
attribute of the corresponding <otu>
element (that is, the element whose id
corresponds to the row's otu
attribute)?
@hlapp @xu-hong others what do you think of this approach (now implemented on the drop-nex
branch): https://github.com/ropensci/RNeXML/blob/611de7caa9fc8335b82b29f664574b154ec09d9f/inst/examples/merge_data.md
Note that I've left the get_characters
just returning the labels
and have done that join on labels
instead of id
, though I'm still not sure if that's ideal or not, particularly since label
need not be required. I think get_characters
might be better off returning label
and otu
id information to be explicit, but perhaps not.
So I think this issue is now addressed by PR #133 and the fix to #135. Please highlight any remaining problems as new issues so we don't lose track of them.
Perhaps there is an easy way in the API already - how does one get at the the taxon IDs as annotated, for example, in the form of dwc:taxonID metadata? Like here in the NeXML produced by the Phenoscape API:
cc @xu-hong.