shashi / FileTrees.jl

Parallel computing with a tree of files metaphor
http://shashi.biz/FileTrees.jl
Other
88 stars 6 forks source link

Lazy default children #25

Open DrChainsaw opened 4 years ago

DrChainsaw commented 4 years ago

About my story from the discourse thread, I was a bit bored so I cooked up a solution which seems to work for my use case.

I don't know if this is clean and robust enough to be worthwhile to add, but here it is in case anyone finds it useful.

To summarize the use-case: Sometimes files contain multiple items which are useful to view as separate Files in the filetree in a way where it is not possible to know exactly which items they contain. One example of this are text logs where each line is printed out from some part of a program and different parts print different things, something like this:

> cat some.log
thisorthat_func:234 par1=324, par2=happy,...
thisorthat_func:234 par1=12, par2=sad,...
someother_func:78 x=34.4, z=11,...
thisorthat_func:234 par1=32, par2=sad,...
...

One might want to see this as

some.log/
├─ thisorthat_func234 DataFrame(par1, par2,...)
└─ someother_func78 DataFrame(x, y,...)

It is often known roughly what could be in there, so it makes some sense to have default children. Here I use a dict which maps child names to their values for easier groking, but this should be possible to abstract to a generic function. It is also hardwired to assume laziness for the sake of brevity.

defaultchildren(names, val=s -> NoValue()) = f -> defaultchildren(f, names, val)
function defaultchildren(f::Union{File,FileTree}, names, val) 
    v = f[]
    v.cache = true
    maketree(name(f) => defaultchild.(names, Ref(v), val))
end
defaultchild(n, v, val) = (name = n, value = FileTrees.maybe_lazy(d -> get(d, Symbol(n), val(n)))(v))

Here is an example:

julia> ft = maketree("1" => [(name="2", value=lazy(() -> Dict(:a => 1, :b => 2, :c => 3))())])
1/
└─ 2 (Thunk(#21, ()))

julia> map(defaultchildren(["a", "b", "c"]), ft; dirs=false)
1/
└─ 2/
   ├─ a (Thunk(#11, (Thunk(#21, ...),)))
   ├─ b (Thunk(#11, (Thunk(#21, ...),)))
   └─ c (Thunk(#11, (Thunk(#21, ...),)))

julia> ftd = map(defaultchildren(["a", "b", "c"]), ft; dirs=false)
1/
└─ 2/
   ├─ a (Thunk(#11, (Thunk(#21, ...),)))
   ├─ b (Thunk(#11, (Thunk(#21, ...),)))
   └─ c (Thunk(#11, (Thunk(#21, ...),)))

julia> reducevalues(+, ftd[r"(a|c)"]) |> exec
4

So far so good, but the big drawback is that if there are values produced by the creation operation which we have not guessed are there, they will never be seen:

julia> ftd = map(defaultchildren(["a", "b"]), ft; dirs=false)
1/
└─ 2/
   ├─ a (Thunk(#11, (Thunk(#21, ...),)))
   └─ b (Thunk(#11, (Thunk(#21, ...),)))

julia> reducevalues(+, ftd[r"(a|c)"]) |> exec
1

Another issue is that one might not want to have the default children, only the ones which actually materialized.

Both of these are adressed by the following extension:

# Slight redefinition of defaultchildren:
function defaultchildren(f::Union{File,FileTree}, names, val) 
    v = f[]
    v.cache = true
    maketree((name=name(f), value=lazy(fixmeup)(f)) => defaultchild.(names, Ref(v), val))
end

# This should obviously have a better name
struct FixMeUp{T,F}
    v::T
    rmchild::F
end
fixmeup(f::File, rmchild=f -> f isa File && f[] isa NoValue) = FixMeUp(f[], rmchild)
function FileTrees.FileTree(parent::Union{FileTree,Nothing}, myname::String, children::Vector{T}, value::FixMeUp) where T
    d = exec(value.v)

    # Create children for keys in the dict which did not have default children
    newchildren = [File(nothing, string(dk), dv) for (dk, dv) in d if string(dk) ∉ name.(children)]

    # Remove (typically default) children which we don't want (e.g. have NoValue) 
    cf = filter(!value.rmchild, vcat(newchildren, children))
    return FileTree(parent, myname, cf)
end

Now what happens? Check it out:

julia> ftd = map(defaultchildren(["a", "b", "y", "z"]), ft; dirs=false)
1/
└─ 2/ (Thunk(fixmeup, (File(1\2),)))
   ├─ a (Thunk(#11, (Thunk(#29, ...),)))
   ├─ b (Thunk(#11, (Thunk(#29, ...),)))
   ├─ y (Thunk(#11, (Thunk(#29, ...),)))
   └─ z (Thunk(#11, (Thunk(#29, ...),)))

# Default is to remove all Files with NoValue
julia> ftd |> exec
1/
└─ 2/
   ├─ c (Int64)
   ├─ a (Int64)
   └─ b (Int64)

# Ok, it is impossible to know whether c will match or not before exec
julia> reducevalues(+, ftd[r"(a|b|c)"]) |> exec
3

# So one has to remember to do this
julia> reducevalues(+, ftd[r"(a|b|c)"] |> exec)
6

# A bit the same with lazy mappings, but given that one knows which default values they put in this should not be surprising
# FixMeUp could be added to things to ignore in mapvalues
julia> ftm = mapvalues(x -> x isa FixMeUp ? x : 10x, ftd)
1/
└─ 2/ (Thunk(#35, (Thunk(fixmeup, ...),)))
   ├─ a (Thunk(#35, (Thunk(#11, ...),)))
   ├─ b (Thunk(#35, (Thunk(#11, ...),)))
   ├─ y (Thunk(#35, (Thunk(#11, ...),)))
   └─ z (Thunk(#35, (Thunk(#11, ...),)))

# Note: c was not multiplied 
julia> reducevalues(+, ftm[r"(a|b|c)"] |> exec)
33
shashi commented 4 years ago

but the big drawback is that if there are values produced by the creation operation which we have not guessed are there, they will never be seen

This is wholly up how the data just is right? Or does FileTrees API somehow exacerbate it?


# Ok, it is impossible to know whether c will match or not before exec
julia> reducevalues(+, ftd[r"(a|b|c)"]) |> exec
3

# So one has to remember to do this
julia> reducevalues(+, ftd[r"(a|b|c)"] |> exec)
6

I agree this is bad!

I'm still working through all the content in this issue haha! Thanks a lot. We should probably open smaller issues for each item we can improve.

DrChainsaw commented 4 years ago

Sorry for dumping a wall of text in the tracker. I hope it wasn't too painful to read :)

This is wholly up how the data just is right? Or does FileTrees API somehow exacerbate it?

I would say it is a characteristic of the first naive defaultchildren implementation which assumes only the first given children will be created. It was just meant as a motivation for why that hacky FixMeUp struct was there in the second part. FileTrees API does not do anything to make it worse.

In case it was unclear: I'm perfectly happy to have this as a one-off solution in my code. It may very well be one of those things where it is easier for people to create their own solution than it is for them to understand how to use a generic one. Having unkown children in the tree is most likely gonna end up clashing with lazyness in a lot of ugly ways.