shashi / FileTrees.jl

Parallel computing with a tree of files metaphor
http://shashi.biz/FileTrees.jl
Other
88 stars 6 forks source link

Some thoughts about data tree design #9

Open c42f opened 4 years ago

c42f commented 4 years ago

Hi Shashi it was nice to chat about this!

I had some thoughts about the design and how it relates to what I've been thinking about

In general, I think we're building something related but largely complimentary: in DataSets.jl I'm focusing on how one lazily reads the data index and data "from disk" — or other static location. I want to declaratively define such data locations and systematically turn that config into Julia objects the user can work with in their program. Have a system to move such data between storage backends etc etc. (Of course, DataSets.jl isn't restricted to trees. In principle the same ideas apply to the many tabular data formats, and data we'd often consider as a "single file"; eg large images or other multidimensional arrays.)

shashi commented 4 years ago

So perhaps DataTree would be a more descriptive name.

That's a good suggestion! Yes, it's only called file trees because it uses paths. There's a DataTrees.jl package already haha :sweat_smile:

tree structure happens to match the desired partitioning of work

You mean the files are already pretty equally distributed in size (and work)? Yeah that's true in edge cases like one big CSV file. Dagger can handle irregularity in workload and avoid starving workers, so it's not so bad.

But I see your iterators idea! That's cool! I'm excited to see what you make.

Okay yes, DataSets.jl idea is clearer to me now! Thanks for that. I really feel it would be nice to have generically typed indices.

c42f commented 4 years ago

There's a DataTrees.jl package already haha

Oh :grimacing:. But then again, it seems to have only four commits and not be registered which I guess means it's essentially abandoned. So the name may be available after all :)

rofinn commented 3 years ago

Looks like I've been inadvertently doing a lot of the same things as FileTrees.jl, just completely outside the context of filesystems. In AxisSets.jl, I'm storing an associative of paths (Tuple{Vararg{Symbol}}) to AxisKeys.KeyedArrays. I even do a reduced glob pattern matching.

I think if the FileTree type was extracted into a slightly more general TreeDict type, then I could just replace a lot of that logic with the more general structure, and just focus on dimension alignments. I've been considering implementing that type as an extension of Dictionaries.AbstractDictionary, so I can implement a set of indices that are optimized for prefix and pattern matching searches.

shashi commented 3 years ago

I've been considering implementing that type as an extension of Dictionaries.AbstractDictionary, so I can implement a set of indices that are optimized for prefix and pattern matching searches.

That sounds good. What would be the keys and values if FileTree was to be made a AbstractDictionary?

c42f commented 3 years ago

The way I've been thinking about this in DataSets.jl:

These rules mean that such trees are not exactly like AbstractDict because that iterates key-value pairs. However I believe value-iteration is just a lot better for data-driven work than iterating key-value pairs by default and Dictionaries.jl gets this right.

In addition, you can add paths to the mix