Open karthik opened 10 years ago
I'm excited to hack on this! I'm finishing up the first big release of Dat now so the timing here is really good for me. I've been wondering how to best deal with streaming data and was able to cobble together this example https://github.com/maxogden/dat/blob/master/examples/transform.r but am looking forward to a deeper dive into the complexities of dealing with larger-then-memory datasets in R.
I think it would be great if we could build something that can load either an entire or just a portion of, or even a moving window of a Dat repo, and then can either export a new reduced dataset or can transform the original.
Dat looks great.
At some point I was hacking on elephant, which is an R object that behaved like an environment, but where every value stored in it would be transparently backed to git. The idea was to have easy, versioned checkpointing of data throughout a script. I wonder if using dat would make more sense.
Dat enables sharing/streaming of large data in a workflow similar to how Git operates. Using dat's API we can build an R interface to allow streaming large data in/out of R's memory.
Repo