ropensci / unconf14

Repo to brainstorm ideas (unconference style) for the rOpenSci hackathon.
28 stars 3 forks source link

Build an R interface to Dat #2

Open karthik opened 10 years ago

karthik commented 10 years ago

Dat enables sharing/streaming of large data in a workflow similar to how Git operates. Using dat's API we can build an R interface to allow streaming large data in/out of R's memory.

Repo

max-mapper commented 10 years ago

I'm excited to hack on this! I'm finishing up the first big release of Dat now so the timing here is really good for me. I've been wondering how to best deal with streaming data and was able to cobble together this example https://github.com/maxogden/dat/blob/master/examples/transform.r but am looking forward to a deeper dive into the complexities of dealing with larger-then-memory datasets in R.

I think it would be great if we could build something that can load either an entire or just a portion of, or even a moving window of a Dat repo, and then can either export a new reduced dataset or can transform the original.

cscheid commented 10 years ago

Dat looks great.

At some point I was hacking on elephant, which is an R object that behaved like an environment, but where every value stored in it would be transparently backed to git. The idea was to have easy, versioned checkpointing of data throughout a script. I wonder if using dat would make more sense.