Open hhaensel opened 3 months ago
Thank you for the PR. Unfortunately, #19 is about adding support for **
to glob, not to fnmatch, so the PR here will not address that any more than the last time this implementation was attempted: https://github.com/vtjnash/Glob.jl/pull/21#issuecomment-456904098
If you want to fix #19, you need to write an updated algorithm for the glob readdir functions in this file that can successfully traverse extra directory levels when encountering a **
, and then filter out any duplicates from the resulting list
Oh, I hadn't seen that implementation of @oxinabox. My implementation is indeed almost identical.
One way of supporting globstar syntax for glob()
would be to define it completely differently:
function globstar(g::Union{AbstractString,Glob.GlobMatch}, prefix::AbstractString = "";
relative::Union{Bool, Nothing} = nothing,
topdown::Bool = true,
follow_symlinks::Bool = true,
onerror::Union{Function, Nothing} = nothing
)
g = Glob.GlobMatch(g)
relative === nothing && (relative = isempty(prefix))
isempty(prefix) && (prefix = pwd())
fn = FilenameMatch(join([fn isa AbstractString ? fn : fn.pattern for fn in g.pattern], "/"), PATHNAME)
matches = String[]
for (root, dirs, files) in walkdir(prefix; topdown, follow_symlinks, onerror)
for file in files
file = joinpath(root, file)
relfile = relpath(file, prefix)
relpattern = Sys.iswindows() ? replace(relfile, '\\' => '/') : relfile
occursin(fn, relpattern) && push!(matches, relative ? relfile : file)
end
end
matches
end
With this definition you get
julia> globstar(glob"**/*.jl")
9-element Vector{String}:
"atest.jl"
"btest.jl"
"ctest.jl"
"noskip\\test1.jl"
"noskip\\test2.jl"
"noskip\\test3.jl"
"noskip\\hh\\test3.jl"
"skip\\test.jl"
"skip\\test2.jl"
julia> globstar(glob"**/*.jl", ".")
9-element Vector{String}:
".\\atest.jl"
".\\btest.jl"
".\\ctest.jl"
".\\noskip\\test1.jl"
".\\noskip\\test2.jl"
".\\noskip\\test3.jl"
".\\noskip\\hh\\test3.jl"
".\\skip\\test.jl"
".\\skip\\test2.jl"
julia> globstar(glob"**/*.jl", pwd())
9-element Vector{String}:
"C:\\Users\\hh\\test\\atest.jl"
"C:\\Users\\hh\\test\\btest.jl"
"C:\\Users\\hh\\test\\ctest.jl"
"C:\\Users\\hh\\test\\noskip\\test1.jl"
"C:\\Users\\hh\\test\\noskip\\test2.jl"
"C:\\Users\\hh\\test\\noskip\\test3.jl"
"C:\\Users\\hh\\test\\noskip\\hh\\test3.jl"
"C:\\Users\\hh\\test\\skip\\test.jl"
"C:\\Users\\hh\\test\\skip\\test2.jl"
Edit: fixed rootdir
Edit 2: refactored to support prefix
and other options for walkdir.
The option relative
is also interesting in cases where you want relative paths with respect to a rootdir, e.g. for additional filtering.
P.S.: the above implementation only works with the current PR in place.
Yes, that seems somewhat right. It is less flexible than the current implementation (which avoids making any assumptions about the contents of what makes up a path and permits arbitrary matches in the sequence including regexes), but otherwise somewhat accurate to it
One could define this function for handling string args and otherwise test whether the pattern vector contains "**" and doesn't contain regexes and only then call this function, otherwise keep the old version in place for compatibility. Tell me how to proceed.
There's a severe bug in the approach of my occursin()
. Needs some rethinking.
Your code seems actually pretty close to right, since you do need to first generate a list of all candidates and then filter out the ones that don't match. To maintain the rest of the matching ability, you just need to replace occursin(fn, relpattern)
with a call to splitpath(fn)
and then write a matcher function to see if each path vector matches with all of the tail patterns, and if not, then to chop off one prefix from the path vector and try matching the tail pattern again, until either the path vector is exhausted or a match succeeds
It seemed very close, but it actually wasn't, because latest with multiple globstars the function would always fail.
So I reverted the changes and instead wrapped the main part of the function in a while loop which restarts the matching process with a new directory. Mutliple levels of globstars are supported by using an array of globstarmatches.
I also added directory matching and a fileonly mode.
Finally, I added a bunch of tests and hope that all edge cases are contained. Don't hesitate to add more if it seems useful.
There is one point I observed when matching the old way:
julia> glob(["a", r".", "c"])
4-element Vector{String}:
"a\\.b\\c"
"a\\.c\\c"
"a\\b\\c"
"a\\c\\c"
So it seems that one would have to set start and end markers in order to only match a single character for the middle directory.
I think I got everything sorted out by now.
Concerning the partial regex match, I think the best way is to adapt the docs and replace r"."
by r"^.$"
Some final thought perhaps:
@vtjnash I continued thinking about my statement about multiple globstars. I thought that I had found an example that would not pass without my array queue but I couldn't repeat it. Mathematically I now think, it is not necessary to retry with a new position of the first wildcard as soon as the second one is reached. So not queuing saves a lot of computing. Therefore, I went back to simple indexing as you did with star and all tests pass.
During that scan I found one error in my code with globstar_period
which should be solved now.
I just want to apologize that I have barely any time for development for the next couple weeks, but I should be fully back in July. Originally my expectation was that if someone wanted fancier options, they could construct the object manually, instead of using the macro to split it.
@vtjnash understood, thanks for the feedback.
I'm still not completely satisfied with the current status for various reasons
globstar matching I
We could easily support gitignore-patterns, but we either need another flag to switch on/off globstar matching or we always turn on globstar and break with the IEEE standard
"**/"
, but it would be correct to rather detect "/**/"
'!'
, should we support that syntax?walkdir()
which always lists all files in all sub directories even when it is clear that the pattern does not match. A better implementation would iterate the tree explicitly and not travel through non-matching subdirs.If ever you find the time for thinking about these points, just drop a note here
Any news on this one?
Yeah, I am finally able to get back to this. Thank you for your patience and persistence!
I think we should break this down into a couple different PRs for easier discussion and review:
fnmatch
glob
!
as an additional featureWould you mind making a new PR with just the fnmatch improvements? We can keep this open until everything is merged, but it would help me with thinking through reviewing that it handles **
correctly. Secondly though, I see there are a couple of different, distinct interpretations of what globstar means. We will need to document what exactly we intend for this repository to implement. The zsh code is MIT licensed, so we can copy this text directly:
https://man.freebsd.org/cgi/man.cgi?query=zshexpn&sektion=1
Looks like the patterns to detect are **/
, /**/
, /**
, and **
alone, or more concisely, it can be treated as being equivalent to r"(^/)**(/$)"
for pattern detection. I appreciate that zsh
then treats **/
as matching zero or more directories (instead of one or more as bash does), so I think that is worth copying from them.
some accept globstar but don't restrict '*'
This sounds like the expected behavior of using fnmatch without the FNM_PATHNAME flag. Is there an additional distinction?
defaulting to PATHNAME
I don't entirely know why I did that. Sounds like a bug actually, as I meant for it to generate separate fnmatch calls for each directory, such that each component of the glob was split on /
but would match any filename it found in there, even if the filenames themselves could end up being arbitrary strings (e.g. if the content being matched was from something filesystem-like, but not actually a posix filesystem)
That sounds like a good proposal. I will come up with a globstar PR for fnmatch asap.
This PR addresses #19