ropensci / unconf14

Repo to brainstorm ideas (unconference style) for the rOpenSci hackathon.
28 stars 3 forks source link

Ensuring Interoperability #20

Open karthik opened 10 years ago

karthik commented 10 years ago

Interoperability between ropensci data sources: We come back to this theme frequently, but have yet to work this out. Our efforts match the data providers model wherever possible, but it would be great to allow researchers to seamlessly work with data from anywhere (e.g. climate data, time series data, species occurrence data). Ideas along the lines of dplyr's philosophy would be great and most welcome.

Moving from #18

karthik commented 10 years ago

From @mfenner

As the developer behind one of the APIs that rOpenSci is using (alm) I would be interested in an interoperability discussion. There is the client tools side to it, i.e. how you write the R package, but there is also the API side, i.e. what are best practices that we can recommend (and that I would be happy to incorporate). Even a very short list of the latter would be a good start, and could be based on some of the major pain points you have writing R libraries to talk to those APIs.

karthik commented 10 years ago

I'm happy to work together on this. Over the past year several data providers have been kind enough to ask us for feedback. It would be great for us to write up a set of guidelines or best practices we can share with these folks.

jeroen commented 10 years ago

Most APIs do this naturally, but when providing data without a schema it is important for type safety that data is structurally consistent. This is explained briefly in the context of json in section 3 of this paper.

cboettig commented 10 years ago

@jeroenooms Thanks for the link, that's quite a nice treatment of the issues here.

I agree that this is pretty straightforward when data providers use a consistent schema, though we've already seen changes to schemas in APIs that we work with, which have required manual changes. With SOAP-based systems one could potentially re-generate the mapping to R data structures automatically. In principle we can automatically generate and re-generate R classes from an XML schema declaration, though the XMLSchema package doesn't seem to handle many of the larger scientific schemas we've tried recently. So even in these cases there are still challenges.

I think this issue arises more clearly in interoperability between our own packages. Even when the packages can follow a consistent data model / schema, those schema differ across packages. Imagine you want to write a function that searches for data from a particular author or data on a particular species, across the suite of rOpenSci packages. That query is possible right now in dozens of our packages, but is constructed differently in each and in a way you cannot predict without looking at the package manual. Something we can overcome I think but it requires more than just consistent schema.

eduardszoecs commented 10 years ago

With SOAP-based systems one could potentially re-generate the mapping to R data structures automatically.

Note that SOAP-support in R is not running very well :( See for example issue https://github.com/ropensci/taxize/issues/213 from taxize :(

sckott commented 10 years ago

I have seen a few recent packages that apparently depend on SSOAP, e.g., http://cran.r-project.org/web/packages/pubmed.mineR/index.html - so at least it has worked for them

eduardszoecs commented 10 years ago

Haven't checked, but this seems also to be OS specific (only working on windoze).

mfenner commented 10 years ago

I don't like SOAP at all, far to complex. I'm happy everyone is moving to REST. Something to look at is API Blueprint.

sckott commented 10 years ago

Completely agree that REST is easier to use. Some APIs we'd like to create R pkgs for though only have a SOAP service.

cboettig commented 10 years ago

@EDiLD @mfenner @sckott Didn't mean to derail the discussion with a mention of SOAP. After all we don't really get to pick what data providers will do, and the key issues are the conceptual ones, not details of syntax (e.g. we want data providers to use a consistent and described schema for the data; it's a rather secondary concern if it's JSON or XML, REST, SOAP, XMLRPC or whatever).

I think this issue should focus on interoperability between data providers and thus between R packages that access them (like @sckott has done with taxize, or we're now doing with the spocc package etc), which isn't really addressed simply by stating each data provider has a RESTful API.

karthik commented 10 years ago

@cboettig I agree that the larger issues are more than SOAP versus REST but in terms of what you actually mean by interoperability between data providers and how that facilitates what we are trying to do, I'm still not entirely clear (Perhaps this discussion is best had in person since we're only two weeks out). At least in the case of spocc we have an opportunity to implement such an idea as a proof of concept, since all the data providers are amenable to standardizing their schema (most already are).

cpsievert commented 10 years ago

I don't have much experience with SOAP, REST, etc., but I think some principles that I've built into XML2R might be helpful for this issue. The first principle is that any XML structure can be represented as a (flat) list of "observations" where list names track the observation location in the XML hierarchy. If other "structures" like JSON, SOAP, REST, etc. fit this principle, we could build a series of methods that return a list of "observations". From here, one could choose from a set of manipulation verbs that operate on those observations. For example, XML2R has an add_key function which preserves parent to child relationships before collapsing observations into a list of tables.

I guess this idea might not lead to interoperability for the end user, but it could at least provide consistent semantics for developers :)

EDIT: Thanks to @cboettig for pointing out that XML2R is not a great paradigm if one does not intend to return/store tabular structures...

cboettig commented 10 years ago

The language a structure is encoded in is not really the relevant issue here; I'm sorry if I introduced that tangent. It does very little for "interoperability" to translate between XML, JSON, and R representations of the same data structure/data model if that data structure is different than another.

One simple example comes from CrossRef vs DataCite schemas. Two Nice RESTful APIs that serve nice schema-valid data in XML or JSON, but they are expressed using different data structures! If I write a nice function for CrossRef that takes a DOI and converts it to R's native bibentry object, I cannot expect that function just work with citation information coming from DataCite, even though most users are probably not aware of the differences between a CrossRef DOI and a DataCite DOI. Such a function is thus not "interoperable".

By mapping each of the data structures onto R's bibentry object, we make the data formats "interoperable" with the suite of existing R tools for handling bibentry objects; but coding that conversion seems to me like a rather manual process that I don't think we can automated by a tool like XML2R, as it involves some fuzzy decisions since the bibentry data structure does not map 1:1 to the CrossRef data structure, etc. And that is the real trick for interoperability -- it's not that we simply need standard formats -- that's just syntax that we can automate our way around -- it's a lack of common data models that is really the challenge here. In principle this is the problem that semantic reasoning a la OWL etc is designed for, but I think in this context a more direct set of standards would be a simpler and more robust solution.

gavinsimpson commented 10 years ago

Another aspect of the interoperability is in the rOpensci interface to those data sources. Rather than users learning the intricacies of n packages there is scope to generalise some of the functionality across all packages or groups of packages (as has been done in taxize or spooc). This applies to the interface the package provides to the user at the R end, but also how the data are stored/structured when returned to the user.

Are there common areas of code that could be packaged separately, say as the ropensci package (upon which others could depend/import) and which other packages could rely upon. Or key extractor/summary functions that could be turned into simple methods with classes for different data sources provided by the respective packages. I'm thinking things like summarising returned records (how many etc), API key handling, and such like. So as the range of ropensci packages increases, package authors don't need to reinvent the wheel each time or duplicate code.

Not sure if this discussion/issue is the right place for this particular aspect, but I thought I'd raise it here rather than key stum.

sckott commented 10 years ago

That's right on the mark @gavinsimpson - Thanks for your thoughts. Agreed Gavin, consistency in both the A) "UI" (function interface users see) and B) data structures returned to user.

We have played around with a universal package that we could depend on in all our packages, see here (https://github.com/ropensci/ropensciToolkit), but as you could imagine this is a hard thing to agree on. Would be great insight from you and others that are smarter than us about how to do this right

emhart commented 10 years ago

We've definitely talked about this, we have another issue about consolidation as well (see issue #6 ). But right now we enforce interoperability , e.g. We translate GBIF and inat and ecoengine into a common format. I not sure there is a better way to do this other than us just going through each source and developing a common S3 class.

cboettig commented 10 years ago

:+1: to @gavinsimpson 's points; I think that's the real key here. Sometimes there is an obvious R data structure to map onto (e.g. bibentry object, or mapping phylogenetic data to an ape::phylo or spatial data to the various spatial structures already supported in R), though even then these objects often lack all the metadata we'd want. Then there's the metadata about the API communication which we could be much more standard about; perhaps just consistently returning httr response class objects would be sufficient. Too often our packages aren't returning that metadata.

The real challenge is in the data structures themselves. This problem extends far beyond us and there are two standard ways to tackle it: vertical integration of all the 'lowest-common-denominator' data, such as species names, geographical coordinates, any relevant citation data, and so forth; or the Metadata-driven approach sensu Jones et al. We've talked a bit about this before, and I think as in the cases above like bibentry and httr response it's best we don't reinvent the wheel for a common data structure but that we build on existing metadata standards to do this, such as EML, NeXML, NBII, BDP, ISO19139, and RDF semantic representations. In a sense this approach is much easier for integrating a few dozen large repositories than for integrating the arbitrary format of any researcher separately.

gavinsimpson commented 10 years ago

@emhart Whether you want a formal S3 class or just a constructor as.foo() function to facilitate this, having that implemented and used across two or more packages would be beneficial. You'll never stuff all data types into a single, all encompassing class, but the rOpensci project is sufficiently established now I would have thought, to do a little retrospective and see what could be made common across all packages and what could be done for say a group of related ones.

Knowing that an rOpensci package always returned a list say with $response and that was an R object of particular class would be a good start towards interoperability at the R level for example.

Thanks @sckott - Will take a look at that to familiarise myself with what you've looked at already.

emhart commented 10 years ago

@gavinsimpson I agree, and as you pointed out we're part way there with spocc and taxize, in so much as we took @cboettig vertical integration approach with all lowest common denominator pieces for information. We've also worked a bit to convert from our data structures to wider formats. For instance in both rWBclimate and spocc we have functions that convert them into classes from the sp package. This allows us to have some interoperability between those two sources. But most of this has been haphazard at best, and we've really just operated ad hoc. I think if nothing else we could outline some best practices here. This particular aspect might fit nicely in with issue #23 that @hadley raised.

@cboettig I think the metadata driven approach falls into the "nice idea" category, but I just don't see an easy way for us to implement it. Even for something like writing EML data for spocc returns we can define data table headers, but given the way the standard is structured the meaning of those headers is still arbitrarily defined by us, as opposed to strictly enforced by a standard like DwC. RDF semantic expressions sounds awesome, but are there any API's that even return them? I think our problems with this approach just reflect the short comings of the larger community. Yes, it'd be awesome to use it, but until ecology figures it's shit out on this front, it's hard for us to do much. I'll add a caveat to this and say that in specific arenas, these things are a bit more figured out. I think DwC is a perfect example of this. There's no reason we couldn't convert spocc S3 classes to valid DwC archives. Maybe this could be done for NeXML (I don't know enough to say). But to me these are edge cases and so we may want to try the vertical integration approach since this is easiest for us (and still hard to do).

cboettig commented 10 years ago

@emhart Yes, I think you've hit the nail on the head. A metadata standard doesn't let us do the integration until we use common definitions instead of free-form text. Ideally definitions that we can express in formal semantics. But I think this is in fact the ideal use case to do so, because after all it will be our packages writing the header definitions, so all we have to do is write them consistently. So this case is so much easier than having a user just give you EML with whatever arbitrary description they want. This is logically identical to defining an S3 class or any other data structure in R, where one would rely on consistent use of names. And like you say, ideally we should do this using established ontologies where possible (e.g. to say things like this column contains species names), though as long as we are consistent internally we will at least be interoperable internally (and could even provide our own semantic definitions for our column classes eventually). Doing this semantically is easier in NeXML, but I think this is practical for EML as well.

I think we should pursue both strategies -- the vertical integration has obvious and immediate value -- but I don't think the metadata approach is out of reach for us.

karthik commented 10 years ago

Another aspect of the interoperability is in the rOpensci interface to those data sources. Rather than users learning the intricacies of n packages there is scope to generalise some of the functionality across all packages or groups of packages (as has been done in taxize or spooc). This applies to the interface the package provides to the user at the R end, but also how the data are stored/structured when returned to the user.

Great point @gavinsimpson It's been the original idea we discussed more than 2 years ago. It was a little too idealistic to have just one ropensci package (that was back when we had like 4 packages, heh) where we could provide a single portal with individual packages plugged into the back end. It would certainly help the cognitive load on the users. The sheer number of packages (and more in the pipeline) can just be overwhelming.

But one thing that's become apparent is that we need fewer packages that each encompass a small theme. We already have for example spocc that provides access to half a dozen other packages. Even with this we've had some growing pains but we are learning.

I'm hoping to push similarly for one that is a data pipeline to various repositories. Starting first with the ones we already work with like figshare, dvn, and soon dataone and zenodo.

The users can simply change the destination and not worry about the internal API. I think this would be a worthwhile effort to accomplish in 2014.

Similarly, all the literature and metadata packages (rplos, RMendeley, ArXiv, etc) could be grouped into one.

But maybe we can use this opportunity to come up with best practices and guidelines before diving in further and having to rework a bunch of plumbing?

I'm thinking things like summarising returned records (how many etc), API key handling, and such like. So as the range of ropensci packages increases, package authors don't need to reinvent the wheel each time or duplicate code.

Exactly my thought too. It was really the motivation to include as much information in the return object as possible to allow others to retrieve the same data from the API.

> ecoengine::pinus <- ee_observations(scientific_name = "Pinus")
> pinus
Total results on the server: 48333
Args:
country = United States
scientific_name = Pinus
georeferenced = FALSE
page_size = 25
page = 1
Type: observations
Number of results retrieved: 25

and that's what's available from spocc. Glad to hear you support the same approach.

cboettig commented 10 years ago

This idea is now listed on the project page with some rough outline possibilities. https://github.com/ropensci/hackathon/wiki/Projects