ropensci / unconf15

rOpenSci's San Francisco hackathon/unconf 2015
http://unconf.ropensci.org
35 stars 7 forks source link

Random access/queriable serialization format for R objects #37

Open gmbecker opened 9 years ago

gmbecker commented 9 years ago

Serialized R objects are everywhere, from cluttering our workspaces to provided package data. Currently, however, such objects are "all or nothing", in that to get any piece of the saved object, or to even determine what objects are saved in a particular rda/RData file, we have to load the whole thing into memory.

It would be nice to have a serialization format amenable to to inspection and "random" - in the access sense - subset retrieval.

Packages such as bigmemory offer something like this for matrices, but I'm talking about a general solution which could act as a swap-in replacement for save().

Self-describing data formats such as Avro https://avro.apache.org/ and some form of external indexing akin to tabix are two approaches that seem promising. Packages such as BigMemory

rdpeng commented 9 years ago

I like this idea too and it has been in the back of my mind for a while. I seem to remember there being an effort to implement it a long time ago, but I think there was a hiccup with respect to preserving references shared across objects.

dani-lbnl commented 9 years ago

We need it badly, hope it gets selected!

hafen commented 9 years ago

I'm very interested in this as well. I suppose there is a question about what type of data you are thinking of. If you are thinking large data frames and you want random access to rows, that's one thing.

What I'm particularly interested in is storing large lists of arbitrary objects as key-value pairs and having random access by key, with a serverless solution. R is lacking sqlite-equivalent support for this case. There was a berkelydb R package that seems to be abandoned that would achieve this, but there are a lot of other similar technologies that it would be great if R could support, such as

richfitz commented 7 years ago

Anyone following this ooold thread; I have started work on indexing rds here

jeroen commented 7 years ago

The corpus package by @patperry has an interesting implementation of memory mapping strings within json objects. It's one of the best working examples of accessing data within an object on disk without loading the entire thing in memory.

richfitz commented 7 years ago

My motivation for this is actually another memory mapping thing - thor which is a memory-mapped key value store 😀