ropensci / RNeXML

Implementing semantically rich NeXML I/O in R
https://docs.ropensci.org/RNeXML
Other
13 stars 9 forks source link

SPARQL use case #73

Closed rvosa closed 10 years ago

rvosa commented 10 years ago

Now that it is so relatively painless to extract RDF and run SPARQL queries on it (incidentally: great job, supercool) I think it would be good to develop a more persuasive use case to demonstrate the power of this facility.

Here's an idea: let's say we have a tree, some trait data and some occurrences for a set of species. As usual, after all the data cleaning, we find that the species in the tree, in the trait data and the occurrences are only partially overlapping. It ought to be possible to extract the union of the taxa across these different data sources by way of a query.

What do you guys think - is that the coolest we can come up with (hopefully not?) and do we have some published data lying around that we could use to demonstrate this?

cboettig commented 10 years ago

A meaningful SPARQL example would be great. Your proposed use case does sound like a common one that many researchers could relate to. My only thought is that most R users would be more familiar with simply importing the tree and trait data, etc, and then extracting the union (the treedata() function from the geiger package being probably the most common way users handle this, though the function assumes perfectly matching species names being used on both tree and trait data).

I was wondering if we might have an example that emphasizes the logical reasoning of SPARQL that doesn't have an immediate SQL-like analog. For instance, a query that makes use of some ontology in identifying which species listed in the target dataset are a member of the queried taxonomic class or something (e.g. see our earlier thread: https://github.com/ropensci/RNeXML/issues/20#issuecomment-29642194 ). Maybe that would be involved in the use case you already described.

Will give a thought to some good published data examples.

rvosa commented 10 years ago

Reasoning would be really great but might be hard to demonstrate - do you know of any reasoning engines that are exposed to R?

On Tue, Jul 1, 2014 at 10:06 PM, Carl Boettiger notifications@github.com wrote:

A meaningful SPARQL example would be great. Your proposed use case does sound like a common one that many researchers could relate to. My only thought is that most R users would be more familiar with simply importing the tree and trait data, etc, and then extracting the union (the treedata() function from the geiger package being probably the most common way users handle this, though the function assumes perfectly matching species names being used on both tree and trait data).

I was wondering if we might have an example that emphasizes the logical reasoning of SPARQL that doesn't have an immediate SQL-like analog. For instance, a query that makes use of some ontology in identifying which species listed in the target dataset are a member of the queried taxonomic class or something (e.g. see our earlier thread: #20 (comment) https://github.com/ropensci/RNeXML/issues/20#issuecomment-29642194 ). Maybe that would be involved in the use case you already described.

Will give a thought to some good published data examples.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/73#issuecomment-47703641.

rvosa commented 10 years ago

With commit a7c8ffd I have added some example data which I believe might be interesting to demonstrate (recursive?) SPARQL queries.

The NeXML file primates.xml contains a supertree of the Primates. The otus block contains both the terminal taxa and the higher taxa (genus through order). The nodes in the tree link to these taxa, so interior nodes may also have otu attributes that correspond with taxa (provided the tree makes these taxa monophyletic).

The general idea is that we should be able to query for all the members of a higher taxon - so given the URI of the higher taxon, give me all the direct descendants that specify rdfs:subClassOf for that taxon. Secondly, it might be nice to then be able to extract the subtree for those taxa (and plot it?), or show recursive calls to traverse the taxonomy.

Unfortunately, there appear to be some bugs in how the RDF is extracted. In particular, the namespace prefixes are not extracted correctly in the file primates_meta.xml.

What we should be getting is:

xmlns:concept="http://rs.tdwg.org/ontology/voc/TaxonConcept#" <concept:toTaxon rdf:resource="http://ncbi.nlm.nih.gov/taxonomy/34827"/>

But instead we are getting:

xmlns:ns1="concept:" ns1:rank rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonRank#Species"/>

I gather that this RDF is obtained by posting the NeXML to a web service, so its output is out of our control. I would like to suggest an alternative that could build on commit e3845d6. In that commit I have added an XSL stylesheet that extracts RDF/XML from RDFa. The output it produces is valid, and we should be able to run it locally, probably with better performance. However, this means we would create a dependency on a library that can process XSL stylesheets, such as this one: http://www.omegahat.org/Sxslt/

rvosa commented 10 years ago

With commit d61a0c5 I have added an example that shows how we can query the valid RDF/XML that the XSL stylesheet produces. The example shows how you can fetch the taxon whose taxonomic rank is "Order", and return the corresponding NCBI taxon URI. Subsequently, with that URI, the example shows how to fetch its children.

A person that actually knows R (so, not me ;-)) would be able to take these examples to write a simple recursive traversal from the root to the tips. As the URIs of the subjects in this graph are constructed from the id attributes in the input NeXML it ought to be possible to get the taxa and tree nodes that correspond with these RDF subjects, e.g. to extract subtrees and plot them.

rvosa commented 10 years ago

I played around with sparql.R a bit more. It is failing, but I hope someone will be able to get the recursion to work so it generates a newick string which we then plot. Bonus points if the newick string can have the taxon names from the original NeXML.

cboettig commented 10 years ago

Very cool!! Look forward to digging in to your example when I'm back.


Carl Boettiger http://carlboettiger.info

sent from mobile device; my apologies for any terseness or typos On Jul 4, 2014 5:02 PM, "Rutger Vos" notifications@github.com wrote:

I played around with sparql.R a bit more. It is failing, but I hope someone will be able to get the recursion to work so it generates a newick string which we then plot. Bonus points if the newick string can have the taxon names from the original NeXML.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/73#issuecomment-48058379.

rvosa commented 10 years ago

As of 81da59b, the RDF/XML taxonomy is traversed by recursive SPARQL queries, whose results are serialized to a Newick string with unbranched interior nodes, no branch lengths, and (optionally) interior node labels. In other words: it's a classification tree, which can be plotted as a cladogram, as the example shows. I think this would be a pretty neat use case for the supplementary materials: it's a bit too long (72 lines) to put in the MS body. To clean this up I am going to need a little more help, still:

cboettig commented 10 years ago

Very nice. I've just updated get_rdf, and will:

rvosa commented 10 years ago

Excellent! Sorry I don't know the conventions (yet), but it's fun to learn them.

On Wed, Jul 9, 2014 at 9:14 PM, Carl Boettiger notifications@github.com wrote:

Very nice. I've just updated get_rdf, and will:

  • take a go over the code idioms to make the example a bit more native.
    • I will also add this to the manuscript appendix (referencing appropriately from the SPARQL section).
    • Then I can move the sparql.R into a demos/ directory (which is the usual place for such things in R packages; allowing them to be run interactively from the command line. inst/examples is a more generic dumping ground for things that aren't necessarily R scripts.)

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/73#issuecomment-48520927.

cboettig commented 10 years ago

@rvosa I was just thinking about trying to make the figure generated by sparql.R a bit easier to read but am running into trouble. My first thought was to plot just the internal node names (higher taxa levels), which would mean fewer labels crowding the plot and also make it clear that the cladogram just reflected the taxonomy.

I followed the suggestion in your code about adding get_name(id) to the recurse function definition so I have a Newick tree with internal nodes labeled, but that seems to be giving me a Newick tree that I cannot parse for some reason. Maybe you can have a quick look? Thanks much!

cboettig commented 10 years ago

@rvosa For quick reference, here's the Newick file I get when trying to add the node labels; not sure why it fails to parse (either using phylobase::readNewick, which uses the nexus class library, or using phytools::read.newick): https://github.com/ropensci/RNeXML/blob/96add29b379748a6dae302c483e6bbaf25297a7e/inst/examples/sparql.newick

rvosa commented 10 years ago

The tree description is valid in principle (you can paste it into figtree, for example), but some of the newick parsers that I've played around with seem to be picky about i) there are no branch lengths; ii) there are "unbranched" interior nodes; iii) there are node labels.

On Fri, Jul 18, 2014 at 12:03 AM, Carl Boettiger notifications@github.com wrote:

@rvosa https://github.com/rvosa For quick reference, here's the Newick file I get when trying to add the node labels; not sure why it fails to parse (either using phylobase::readNewick, which uses the nexus class library, or using phytools::read.newick): https://github.com/ropensci/RNeXML/blob/96add29b379748a6dae302c483e6bbaf25297a7e/inst/examples/sparql.newick

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/73#issuecomment-49372893.

cboettig commented 10 years ago

@fmichonneau Maybe you might have some idea why we I can't parse this Newick file successfully in R? e.g. with phylobase:

download.file("https://github.com/ropensci/RNeXML/raw/96add29b379748a6dae302c483e6bbaf25297a7e/inst/examples/sparql.newick", "sparql.newick", "wget")
readNewick("sparql.newick")

Gives me:

Warning:  
 A TAXA block should be read before the TREES block (but no TAXA block was found).  Taxa will be inferred from their usage in the TREES block.
at line 1, column (approximately) 5105 (file position 5104)
storing implied block: TAXA
storing read block: TREES
Error: index out of bounds
In addition: Warning message:
In FUN(X[[1L]], ...) : NAs introduced by coercion

though it seems like a valid tree (e.g. can be read into figtree)...

fmichonneau commented 10 years ago

I think this is a bug in ape (Unfortunately, phylobase still relies on ape to parse the tree string, phylobase uses NCL to extract information about the taxa, branch lengths, labels, etc, but on ape to convert the parentheses and commas into an R object). Apparently, ape doesn't support edge labels on terminal edges. To have edge labels on terminal edges, taxa need to be in parenthesis by themselves like so (Avahi_laniger)Avahi,(... However, this apparently is not supported by ape:

ape::read.tree(text="(1,(2,3));")

gives


Phylogenetic tree with 3 tips and 2 internal nodes.

Tip labels:
[1] "1" "2" "3"

Rooted; no branch lengths.

But

ape::read.tree(text="((1),(2,3));")

gives

Error in if (sum(obj[[i]]$edge[, 1] == ROOT) == 1 && dim(obj[[i]]$edge)[1] >  : 
  missing value where TRUE/FALSE needed

This works with the phytools parser:

 phytools::read.newick(text="((1),(2,3));")

but the string from the example doesn't work (R hangs).

I reported the ape's bug to Emmanuel

cboettig commented 10 years ago

Okay, thanks for taking a look. Yeah, I'd given phytools a try too and I ping'd Liam about the issue. Keep me posted if you figure anything out from Emmanuel, but nothing mission critical here.

On Wed, Jul 30, 2014 at 8:58 AM, Francois Michonneau < notifications@github.com> wrote:

I think this is a bug in ape (Unfortunately, phylobase still relies on ape to parse the tree string, phylobase uses NCL to extract information about the taxa, branch lengths, labels, etc, but on ape to convert the parentheses and commas into an R object). Apparently, ape doesn't support edge labels on terminal edges. To have edge labels on terminal edges, taxa need to be in parenthesis by themselves like so (Avahi_laniger)Avahi,(... However, this apparently is not supported by ape:

ape::read.tree(text="(1,(2,3));")

gives

Phylogenetic tree with 3 tips and 2 internal nodes.

Tip labels: [1] "1" "2" "3"

Rooted; no branch lengths.

But

ape::read.tree(text="((1),(2,3));")

gives

Error in if (sum(obj[[i]]$edge[, 1] == ROOT) == 1 && dim(obj[[i]]$edge)[1] > : missing value where TRUE/FALSE needed

This works with the phytools parser:

phytools::read.newick(text="((1),(2,3));")

but the string from the example doesn't work (R hangs).

I reported the ape's bug to Emmanuel

— Reply to this email directly or view it on GitHub https://github.com/ropensci/RNeXML/issues/73#issuecomment-50635660.

Carl Boettiger UC Santa Cruz http://carlboettiger.info/

cboettig commented 10 years ago

Okay, with Liam's bugfix http://blog.phytools.org/2014/07/new-version-of-readnewick-that-can-read.html we can read the tree in and just plot internal node labels to avoid over-crowding the figure (see https://github.com/ropensci/RNeXML/blob/devel/manuscripts/supplement.Rmd#L330)

I think we have a nice sparql use case now. We could possibly use a bit more text around this example, but I'll wait for others to weigh in.