shashi / FileTrees.jl

Parallel computing with a tree of files metaphor
http://shashi.biz/FileTrees.jl
Other
89 stars 6 forks source link

direct filter! api for `FileTree`? #43

Open Moelf opened 3 years ago

Moelf commented 3 years ago

IMHO it's an anti-pattern to do:

tree[r"regex"]

and currently, you will need to do something close to:

tree = filter(x->!contains(x.name,r"\$|%"), tree);

we should probably allow filter!(pred, tree) to work directly?

(if not not to directly apply predicate on tree.children is a debatable question)

DrChainsaw commented 3 years ago

IMHO it's an anti-pattern to do

Yikes, I use regexes all the time when using FileTrees. Is it possible to get an ELI5 version of this issue?

When I see filter! I think of a mutating version of filter, but the second example looks like some kind of negative filter (i.e ! means not instead of mutates). Thats something I have wanted too (everything which does not match a regex) so perhaps I'm just projecting here :)

Moelf commented 3 years ago

anti-pattern because getindex(array, RegEx) does not make sense, filter(predicate, array) is the universal syntax for this semantics in Julia

DrChainsaw commented 3 years ago

Ah, now I understand.

I guess it is just to enable the convenient bracket syntax, similar to what e.g. DataFrames and Dicts do. Not saying you don't have a point, but FileTrees are not arrays so maybe they don't need to follow array conventions.

Datseris commented 2 years ago

Couldn't agree more with the original statement that this is not the intuitive way. Persoanlly, I would even like to have a filter option at tree creation. I.e., allow something like

taxi_dir = FileTree("taxi-data", f::Function = x -> true) # by default no filtering

or a keyword filter = x -> true that only puts files that survive the filter in the tree.

Datseris commented 2 years ago

In fact, I didn't even understand that the way to create a tree of "only .x files" was by making the full tree and then doing tree[glob"*/*.x"]. I only understood this after I found this issue.

DrChainsaw commented 2 years ago

I have also missed the predicate when creating the tree. Apart from the filtering, one can also return nothing when loading:

ft = load(FileTree("taxi-data")) do f
     endswith(name(f), ".x") && return nothing
     # code to load .x files
end

Unless original author has any objections, I'll accept a PR for the FileTree predicate (make it first argument to support do-syntax).

I'm also somewhat sympathetic to the getindex issue even though I like the current way just for the sake of convenience. Maybe this is whataboutism, but DataFrames support the same convention:

julia> df = DataFrame(a = 1:3, aa = 2:4, b = 11:13)
3×3 DataFrame
 Row │ a      aa     b     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      2     11
   2 │     2      3     12
   3 │     3      4     13

julia> df[:, r"a"]
3×2 DataFrame
 Row │ a      aa    
     │ Int64  Int64 
─────┼──────────────
   1 │     1      2
   2 │     2      3
   3 │     3      4

If the main gripe is the negation, maybe something like tree[Not(r"bla")] or tree[!, r"bla"]), but it does not look very appealing. Negative regexes are also possible to construct afaik (at least for simple match patterns), but it is a bit messy.

Afaik, filter already works on FileTrees, but it gives you the whole node to allow for filtering on values. Not sure we would like to limit it so you only see the names/paths.

julia> t = maketree("root" => [(name="a", value=2), (name="b", value=3)])
root/
├─ a (Int64)
└─ b (Int64)

julia> filter(Returns(true), t)
root/
├─ a (Int64)
└─ b (Int64)

julia> filter(contains("root") ∘ path , t; dirs=false)
root/
├─ a (Int64)
└─ b (Int64)

julia> filter(contains("a") ∘ path , t; dirs=false)
root/
└─ a (Int64)

julia> filter(!contains("a") ∘ path , t; dirs=false)
root/
└─ b (Int64)

Something seems off with the docstring though. It seems to merge the docstrings for map and filter. Furthermore it seems to be incorrect as it has the same semantics as filter for other collections (true means keep). I'll fix that when time permits.

 help?> filter
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────  

  map(f, tree::FileTree; walk=FileTrees.postwalk, dirs=true)

  apply f to every node in the tree. To only visit File nodes, pass dirs=false.

  walk can be either FileTrees.postwalk or FileTrees.postwalk.

  filter(f, tree::FileTree; walk=FileTrees.postwalk, dirs=true)

  remove every node x from tree where f(x) is true. f(x) must return a boolean value.

  ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────  
shashi commented 2 years ago

In fact, I didn't even understand that the way to create a tree of "only .x files" was by making the full tree and then doing tree[glob"/.x"]. I only understood this after I found this issue.

We should probably have FileTree(glob"*/*.x") work.

I do not consider the API to be "finished". Would love to accept any PRs with improvements! I tend to avoid having too many options if the same effect can be achieved through existing features.

DrChainsaw commented 2 years ago

We should probably have FileTree(glob"/.x") work.

This is a great idea!

I suppose the equivalent for regex won't make sense though, so maybe we also want the function version?