Some thoughts about data tree design

c42f commented 4 years ago

Hi Shashi it was nice to chat about this!

I had some thoughts about the design and how it relates to what I've been thinking about

I feel like the data structure you have here isn't about files (though "file" is an evocative name); rather it's a general, hierarchically path-indexed and in-memory tree data structure. So perhaps DataTree would be a more descriptive name. I almost called one of my own types this in my prototype! But in the end I settled on FileTree because the thing I've written so far is a lazy view which is explicitly backed by the filesystem rather than being reflected in memory.
In general it should be possible to lazily or eagerly initialize an in-memory "DataTree" from any of various data storage backends, not limited to the filesystem. I think this is already on your mind :)
I think what I want for DataSets.jl is a general lazy interface for hierarchically-indexed trees of data which storage/deserialization backend code can implement. The idea for DataSets.jl is that that one can name such data storage locations declaratively and open them, yielding a Julia type which can read them lazily. Then we would have a whole family S3Tree, ZipTree etc etc with the same basic interface.
A lazy loading interface would be similar in the spirit to how Tables.jl provides an interface which tabular data sources can satisfy. I feel like the fundamental parts of this interface are pretty basic — iteration over children, selecting a child by name, traversing into a child tree node, possibly attributes/metadata access per node (considering how HDF5 attributes, filesystem stat info could be represented)
The idea of having data parallel operations on the tree able to use Dagger.jl seems cool as a high level API for particular workloads where the tree structure happens to match the desired partitioning of work. I think this is a common case, but not always what you want. In any case, processing is definitely out of scope for DataSets.jl!

In general, I think we're building something related but largely complimentary: in DataSets.jl I'm focusing on how one lazily reads the data index and data "from disk" — or other static location. I want to declaratively define such data locations and systematically turn that config into Julia objects the user can work with in their program. Have a system to move such data between storage backends etc etc. (Of course, DataSets.jl isn't restricted to trees. In principle the same ideas apply to the many tabular data formats, and data we'd often consider as a "single file"; eg large images or other multidimensional arrays.)

shashi commented 4 years ago

So perhaps DataTree would be a more descriptive name.

That's a good suggestion! Yes, it's only called file trees because it uses paths. There's a DataTrees.jl package already haha :sweat_smile:

tree structure happens to match the desired partitioning of work

You mean the files are already pretty equally distributed in size (and work)? Yeah that's true in edge cases like one big CSV file. Dagger can handle irregularity in workload and avoid starving workers, so it's not so bad.

But I see your iterators idea! That's cool! I'm excited to see what you make.

Okay yes, DataSets.jl idea is clearer to me now! Thanks for that. I really feel it would be nice to have generically typed indices.

c42f commented 4 years ago

There's a DataTrees.jl package already haha

Oh :grimacing:. But then again, it seems to have only four commits and not be registered which I guess means it's essentially abandoned. So the name may be available after all :)

rofinn commented 3 years ago

Looks like I've been inadvertently doing a lot of the same things as FileTrees.jl, just completely outside the context of filesystems. In AxisSets.jl, I'm storing an associative of paths (Tuple{Vararg{Symbol}}) to AxisKeys.KeyedArrays. I even do a reduced glob pattern matching.

I think if the FileTree type was extracted into a slightly more general TreeDict type, then I could just replace a lot of that logic with the more general structure, and just focus on dimension alignments. I've been considering implementing that type as an extension of Dictionaries.AbstractDictionary, so I can implement a set of indices that are optimized for prefix and pattern matching searches.

shashi commented 3 years ago

I've been considering implementing that type as an extension of Dictionaries.AbstractDictionary, so I can implement a set of indices that are optimized for prefix and pattern matching searches.

That sounds good. What would be the keys and values if FileTree was to be made a AbstractDictionary?

c42f commented 3 years ago

The way I've been thinking about this in DataSets.jl:

keys are path components (file/directory names of immediate children)
values are the immediate children (as in the AbstractTrees interface) - nodes in the tree - the data itself.
Iteration (and broadcast!) iterates values as in the Dictionaries.jl interface.

These rules mean that such trees are not exactly like AbstractDict because that iterates key-value pairs. However I believe value-iteration is just a lot better for data-driven work than iterating key-value pairs by default and Dictionaries.jl gets this right.

In addition, you can add paths to the mix

relative paths are sequences of keys which index depthwise into the tree
getindex with paths is iterated getindex with the path segments

shashi / FileTrees.jl

Some thoughts about data tree design #9