scienceai / RJSONLD

Export results of standard analytics to JSON-LD format
6 stars 2 forks source link

data too? #2

Open sckott opened 10 years ago

sckott commented 10 years ago

Curious if you plan on supporting not just the analysis results, but the data as well? Seems right now like this supports only analysis results. Maybe I'm missing the data part, or perhaps including support for data is opening up a big bag since you could be dealing with GB/TB of data?

tiffbogich commented 10 years ago

@sckott we support code and data (either by hosting it on our registry or pointing to it when it's hosted on another registry, like genbank, for example). Our stack is in node so streaming is good and we can handle data big and small. We store data on S3 and the metadata/indexing on couchdb (via cloudant) for now. Sorry if that wasn't clear! We will definitely have to think about whether we really want to be hosting really large data, very good point :) Ideally we just store the metadata for this and point to it wherever it lives

sckott commented 10 years ago

thanks @tiffbogich - Not sure where data is included though. For example, in the example call

RJSONLD.export(lm(iris$Petal.Length~iris$Sepal.Length), path = "irisLM.jsonld")

The output doesn't include the iris dataset in the output jsonld file. Is there an option to include it?

JDureau commented 10 years ago

Hi Scott,

Thanks for your question. To complete what Tiff answered in the context of RJSONLD more specifically, this package is targeted to objects that are created and live in R, and have no standard way to be exported and shared on the web, like analysis results. It is also able to generate a JSON-LD object out of the analysis results because all of the semantic is already there, we just change the format and make things explicit/standard in cases where R relies on some implicit/ad-hoc descriptions (contrasts in ANOVA's, for example).

For data, it can generally be exported as a CSV, for which more generic tools can handle (ldpm, for example). Also, a dataset per se lacks semantic information on what it represents, so RJSONLD would not do much more than RJSONIO. To solve that, ldpm has a wizzard that asks the user for general meta-information. To make that process easier, we're working on a graphic interface too.

JDureau commented 10 years ago

Regarding your second question, the JSON-LD does not contain the data. The way we see this is that the irisLM.jsonld file would have a ìsBasedOnUrl mention pointing at the iris data. For this, you need the data to have a url.

I could add an option to integrate such url's as options of the call to RJSONLD, actually. You simply need to give your data a url, which is exactly what we're trying to do with ldpm and our website.

sckott commented 10 years ago

For metadata: Any plans to handle arbitrary objects (seems that mostly statistical modeling output objects are handled now)? For example, if a user has a data.frame that holds 10 columns and 100 rows, could RJSONLD allow a user to easily specify metadata for each column and the entire dataset as a whole to go into the jsonld output (and the url for the dataset itself as you mentioned). I guess at least the schema you refer to on your readme deals specifically with statistics though, so perhaps metadata for datasets is out of scope?

On data: Cool, sounds good to reference with ìsBasedOnUrl

JDureau commented 10 years ago

For the moment, data frames / tabular data can be handled with a two-steps process:

The focus in RJSONLD has so far been on generic objects that meet the follwing two criteria:

If you have ideas of such objects, suggestions (or pull requests) are welcome!

sckott commented 10 years ago

hmm, geoJSON (for spatial data in json format) comes to mind, though I guess rgdal package has writeOGR(...) to write out .geojson files