techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
678 stars 35 forks source link

support for multi-file data sets #145

Closed behrica closed 3 years ago

behrica commented 4 years ago

I was reading about the multi-file data set support in the R package of arrow, and found the idea super interesting:

https://arrow.apache.org/docs/r/articles/dataset.html

It seems to me hat something similar could be done in t.d.m as well, at least up to the point to select columns and rows out of such a multi file dataset, ideally using existing functions.

This is an other dimension of working with large data on disk, this time in the form of multiple files.

cnuernber commented 4 years ago

https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html#var-stream-.3Edataset-seq-inplace

This is supported via the arrow mmap pathway. From there you can create a grouped dataset out of the dataset sequence and use tablecloth with it. Tablecloth supports treating multiple datasets as one dataset.

behrica commented 4 years ago

Yes, I can see that. The "group-by" feature of tablecloth does that.

So I just needed a simple helper function which walks over a directory structure of arrow files and creates the required group-by data structures. The different datasets wold themselfs be backed by mmaped files, so not be in memory.

So this would allow to operate on bigger-then-RAM collections of files, very similar then in the example from above R package arrow.

behrica commented 4 years ago

I was experimenting a bit with this with larger arrow files, but then I hit #141.

But I could confirm, that the fix for #141 fixed as well my original problem with working with several large arrow files and reading them via the arrow "inplace" methods.

I will see if I it is ok working with tablecloth 5.0 which should have the fix for #141

behrica commented 4 years ago

I can confirm, that this could be solved on tablecloth level. I have some code, which constructs the correct "grouped data frame" by iterating over arrow files on disk which are mmaped.

This mend we can work with really large "collections of arrow files" on disk fast, and with very little heap usage.

I close it here for the time being and might open an issue in tablecloth

behrica commented 4 years ago

@cnuernber I have a proof of concept of the idea, which seems promising.

I started to implement the (minimal needed) protocols of dataset, based on a seq of arrow file names. It mostly delegates to other methods

This is for the moment independent of tablecloth, so it belongs more here. But my goal is, to have it working with tablecloth as well. (but this should be given, if I indeed implement the needed protocols)

I would propose to close https://github.com/scicloj/tablecloth/issues/6 and continue discussing here.

Alternatively I could open an other github project, in case you don't feel it belongs to tech.ml.dataset. I only have (and likely will have) dependencies on tech.v3.xxx only.

In my opinion, it could even fit here, as you write in the introduction to tech.ml.dataset

"Datasets are currently in-memory columnwise databases and we support parsing from file or input-stream. "

cnuernber commented 3 years ago

This is a duplicate really of earlier issues. I am not interested in working on this at this time. datasets that are represented by a sequence of data I think is best left for another project.