Closed behrica closed 3 years ago
This is supported via the arrow mmap pathway. From there you can create a grouped dataset out of the dataset sequence and use tablecloth with it. Tablecloth supports treating multiple datasets as one dataset.
Yes, I can see that. The "group-by" feature of tablecloth does that.
So I just needed a simple helper function which walks over a directory structure of arrow files and creates the required group-by data structures. The different datasets wold themselfs be backed by mmaped files, so not be in memory.
So this would allow to operate on bigger-then-RAM collections of files, very similar then in the example from above R package arrow.
I was experimenting a bit with this with larger arrow files, but then I hit #141.
But I could confirm, that the fix for #141 fixed as well my original problem with working with several large arrow files and reading them via the arrow "inplace" methods.
I will see if I it is ok working with tablecloth 5.0 which should have the fix for #141
I can confirm, that this could be solved on tablecloth level. I have some code, which constructs the correct "grouped data frame" by iterating over arrow files on disk which are mmaped.
This mend we can work with really large "collections of arrow files" on disk fast, and with very little heap usage.
I close it here for the time being and might open an issue in tablecloth
@cnuernber I have a proof of concept of the idea, which seems promising.
I started to implement the (minimal needed) protocols of dataset, based on a seq of arrow file names. It mostly delegates to other methods
This is for the moment independent of tablecloth, so it belongs more here. But my goal is, to have it working with tablecloth as well. (but this should be given, if I indeed implement the needed protocols)
I would propose to close https://github.com/scicloj/tablecloth/issues/6 and continue discussing here.
Alternatively I could open an other github project, in case you don't feel it belongs to tech.ml.dataset. I only have (and likely will have) dependencies on tech.v3.xxx only.
In my opinion, it could even fit here, as you write in the introduction to tech.ml.dataset
"Datasets are currently in-memory columnwise databases and we support parsing from file or input-stream. "
This is a duplicate really of earlier issues. I am not interested in working on this at this time. datasets that are represented by a sequence of data I think is best left for another project.
I was reading about the multi-file data set support in the R package of arrow, and found the idea super interesting:
https://arrow.apache.org/docs/r/articles/dataset.html
It seems to me hat something similar could be done in t.d.m as well, at least up to the point to select columns and rows out of such a multi file dataset, ideally using existing functions.
This is an other dimension of working with large data on disk, this time in the form of multiple files.