vtjnash / Glob.jl

Posix-compliant file name pattern matching
Other
125 stars 18 forks source link

Hh globstar #38

Open hhaensel opened 3 months ago

hhaensel commented 3 months ago

This PR addresses #19

vtjnash commented 3 months ago

Thank you for the PR. Unfortunately, #19 is about adding support for ** to glob, not to fnmatch, so the PR here will not address that any more than the last time this implementation was attempted: https://github.com/vtjnash/Glob.jl/pull/21#issuecomment-456904098

If you want to fix #19, you need to write an updated algorithm for the glob readdir functions in this file that can successfully traverse extra directory levels when encountering a **, and then filter out any duplicates from the resulting list

hhaensel commented 3 months ago

Oh, I hadn't seen that implementation of @oxinabox. My implementation is indeed almost identical.

One way of supporting globstar syntax for glob() would be to define it completely differently:

function globstar(g::Union{AbstractString,Glob.GlobMatch}, prefix::AbstractString = "";
    relative::Union{Bool, Nothing} = nothing,
    topdown::Bool = true,
    follow_symlinks::Bool = true,
    onerror::Union{Function, Nothing} = nothing
)
    g = Glob.GlobMatch(g)
    relative === nothing && (relative = isempty(prefix))
    isempty(prefix) && (prefix = pwd())

    fn = FilenameMatch(join([fn isa AbstractString ? fn : fn.pattern for fn in g.pattern], "/"), PATHNAME)
    matches = String[]
    for (root, dirs, files) in walkdir(prefix; topdown, follow_symlinks, onerror)
        for file in files
            file = joinpath(root, file)
            relfile = relpath(file, prefix)
            relpattern = Sys.iswindows() ? replace(relfile, '\\' => '/') : relfile

            occursin(fn, relpattern) && push!(matches, relative ? relfile : file)
        end
    end
    matches    
end

With this definition you get

julia> globstar(glob"**/*.jl")
9-element Vector{String}:
 "atest.jl"
 "btest.jl"
 "ctest.jl"
 "noskip\\test1.jl"
 "noskip\\test2.jl"
 "noskip\\test3.jl"
 "noskip\\hh\\test3.jl"
 "skip\\test.jl"
 "skip\\test2.jl"

julia> globstar(glob"**/*.jl", ".")
9-element Vector{String}:
 ".\\atest.jl"
 ".\\btest.jl"
 ".\\ctest.jl"
 ".\\noskip\\test1.jl"
 ".\\noskip\\test2.jl"
 ".\\noskip\\test3.jl"
 ".\\noskip\\hh\\test3.jl"
 ".\\skip\\test.jl"
 ".\\skip\\test2.jl"

julia> globstar(glob"**/*.jl", pwd())
9-element Vector{String}:
 "C:\\Users\\hh\\test\\atest.jl"
 "C:\\Users\\hh\\test\\btest.jl"
 "C:\\Users\\hh\\test\\ctest.jl"
 "C:\\Users\\hh\\test\\noskip\\test1.jl"
 "C:\\Users\\hh\\test\\noskip\\test2.jl"
 "C:\\Users\\hh\\test\\noskip\\test3.jl"
 "C:\\Users\\hh\\test\\noskip\\hh\\test3.jl"
 "C:\\Users\\hh\\test\\skip\\test.jl"
 "C:\\Users\\hh\\test\\skip\\test2.jl"

Edit: fixed rootdir Edit 2: refactored to support prefix and other options for walkdir.

hhaensel commented 3 months ago

The option relative is also interesting in cases where you want relative paths with respect to a rootdir, e.g. for additional filtering.

hhaensel commented 3 months ago

P.S.: the above implementation only works with the current PR in place.

vtjnash commented 3 months ago

Yes, that seems somewhat right. It is less flexible than the current implementation (which avoids making any assumptions about the contents of what makes up a path and permits arbitrary matches in the sequence including regexes), but otherwise somewhat accurate to it

coveralls commented 3 months ago

Coverage Status

coverage: 93.559% (+1.9%) from 91.634% when pulling ba9dc2b7b614a685d52c769f75ea22cc2cf4ea84 on hhaensel:hh-globstar into 5956425511c9a30a2ccceb1861c21748005b85de on vtjnash:master.

hhaensel commented 3 months ago

One could define this function for handling string args and otherwise test whether the pattern vector contains "**" and doesn't contain regexes and only then call this function, otherwise keep the old version in place for compatibility. Tell me how to proceed.

hhaensel commented 3 months ago

There's a severe bug in the approach of my occursin(). Needs some rethinking.

vtjnash commented 3 months ago

Your code seems actually pretty close to right, since you do need to first generate a list of all candidates and then filter out the ones that don't match. To maintain the rest of the matching ability, you just need to replace occursin(fn, relpattern) with a call to splitpath(fn) and then write a matcher function to see if each path vector matches with all of the tail patterns, and if not, then to chop off one prefix from the path vector and try matching the tail pattern again, until either the path vector is exhausted or a match succeeds

hhaensel commented 3 months ago

It seemed very close, but it actually wasn't, because latest with multiple globstars the function would always fail.

So I reverted the changes and instead wrapped the main part of the function in a while loop which restarts the matching process with a new directory. Mutliple levels of globstars are supported by using an array of globstarmatches.

I also added directory matching and a fileonly mode.

Finally, I added a bunch of tests and hope that all edge cases are contained. Don't hesitate to add more if it seems useful.

There is one point I observed when matching the old way:

julia> glob(["a", r".", "c"])
4-element Vector{String}:
 "a\\.b\\c"
 "a\\.c\\c"
 "a\\b\\c"
 "a\\c\\c"

So it seems that one would have to set start and end markers in order to only match a single character for the middle directory.

coveralls commented 3 months ago

Coverage Status

coverage: 92.877% (+1.2%) from 91.634% when pulling fa30a052c1b8040866b47f79465e76935f72dc4e on hhaensel:hh-globstar into 5956425511c9a30a2ccceb1861c21748005b85de on vtjnash:master.

hhaensel commented 3 months ago

I think I got everything sorted out by now. Concerning the partial regex match, I think the best way is to adapt the docs and replace r"." by r"^.$"

hhaensel commented 3 months ago

Some final thought perhaps:

hhaensel commented 3 months ago

@vtjnash I continued thinking about my statement about multiple globstars. I thought that I had found an example that would not pass without my array queue but I couldn't repeat it. Mathematically I now think, it is not necessary to retry with a new position of the first wildcard as soon as the second one is reached. So not queuing saves a lot of computing. Therefore, I went back to simple indexing as you did with star and all tests pass. During that scan I found one error in my code with globstar_period which should be solved now.

coveralls commented 3 months ago

Coverage Status

coverage: 93.277% (+1.6%) from 91.634% when pulling 783e3d9efb4b1bb21eef8a335b85a21e8fc56c6b on hhaensel:hh-globstar into 5956425511c9a30a2ccceb1861c21748005b85de on vtjnash:master.

vtjnash commented 3 months ago

I just want to apologize that I have barely any time for development for the next couple weeks, but I should be fully back in July. Originally my expectation was that if someone wanted fancier options, they could construct the object manually, instead of using the macro to split it.

hhaensel commented 3 months ago

@vtjnash understood, thanks for the feedback.

I'm still not completely satisfied with the current status for various reasons

If ever you find the time for thinking about these points, just drop a note here

0x0f0f0f commented 1 month ago

Any news on this one?

vtjnash commented 1 month ago

Yeah, I am finally able to get back to this. Thank you for your patience and persistence!

I think we should break this down into a couple different PRs for easier discussion and review:

  1. adding globstar-like matching to fnmatch
  2. implementing globstar in glob
  3. considering adding extensions such as ! as an additional feature

Would you mind making a new PR with just the fnmatch improvements? We can keep this open until everything is merged, but it would help me with thinking through reviewing that it handles ** correctly. Secondly though, I see there are a couple of different, distinct interpretations of what globstar means. We will need to document what exactly we intend for this repository to implement. The zsh code is MIT licensed, so we can copy this text directly:

https://man.freebsd.org/cgi/man.cgi?query=zshexpn&sektion=1

Looks like the patterns to detect are **/, /**/, /**, and ** alone, or more concisely, it can be treated as being equivalent to r"(^/)**(/$)" for pattern detection. I appreciate that zsh then treats **/ as matching zero or more directories (instead of one or more as bash does), so I think that is worth copying from them.

some accept globstar but don't restrict '*'

This sounds like the expected behavior of using fnmatch without the FNM_PATHNAME flag. Is there an additional distinction?

defaulting to PATHNAME

I don't entirely know why I did that. Sounds like a bug actually, as I meant for it to generate separate fnmatch calls for each directory, such that each component of the glob was split on / but would match any filename it found in there, even if the filenames themselves could end up being arbitrary strings (e.g. if the content being matched was from something filesystem-like, but not actually a posix filesystem)

hhaensel commented 1 month ago

That sounds like a good proposal. I will come up with a globstar PR for fnmatch asap.