Allow traversal of symbolic links with find*

jwiegley commented 12 years ago

I'm writing a utility that needs to descend into all directories under a directory given by the user. It shouldn't matter whether these are symbolic links or real directory. However, with Shelly's findFold, I can see no way to descend into symbolic links to directories.

gregwebs commented 12 years ago

Yeah, symlinks are intentionally filtered out right now. One problem is that they can lead to circular (never-ending) finds. The symlink filtering can probably be made just a normal directory filter. But I want this filter on by default. Can you suggest an API to support both use cases?

jwiegley commented 12 years ago

Greg Weber notifications@github.com writes:

Yeah, symlinks are intentionally filtered out right now. One problem is that they can lead to circular (never-ending) finds. The symlink filtering can probably be made just a normal directory filter. But I want this filter on by default. Can you suggest an API to support both use cases?

How about something along the lines of:

setFollowSymlinks :: Bool -> Sh ()

And then I could do this:

foo :: Sh [FilePath]
foo = shelly $ do
  setFollowSymlinks True
  find (fromText "/tmp")

I'm not sure if that's the best way, but it avoids having to create a new family of finding functions, or having to pass an option down to them all.

John

gregwebs commented 12 years ago

I wouldn't want to add such a setting into the core of Shelly. But this is a good idea: we could define some setting for the finder that can be build up in a monad, might look like:

shelly $ do
  finder "/tmp" $ do
    followSymlinks True

This feels right for Shelly, but I am not sure if this is better than the record approach

  finder "/tmp" def {followSymlinks = True}

Will have to think some more...

jwiegley commented 12 years ago

Greg Weber notifications@github.com writes:

I wouldn't want to add such a setting into the core of Shelly. But this is a good idea: we could define some setting for the finder that can be build up in a monad, might look like:

shelly $ do finder "/tmp" $ do followSymlinks True

FWIW, I think this is a really nice choice. I'd like to then have other methods, such as:

excludePaths :: [FilePaths] -> ShFind ()

To provide a set of paths not to be traversed. You would be able to create a fairly rich vocabulary for setting up the "search state", and this would then allow you to reduce the number of "find" functions. You wouldn't need findDirFilter if you had a "dirFilter" function in the monad.

John

gregwebs commented 12 years ago

It doesn't look like I am going to be able to get to this for a few weeks. Let me know if you take a crack at it. You can make Shelly work how you want now by just removing the line that checks for the symlink: https://github.com/yesodweb/Shelly.hs/blob/master/Shelly/Find.hs#L69

jwiegley commented 12 years ago

Greg Weber notifications@github.com writes:

It doesn't look like I am going to be able to get to this for a few weeks. Let me know if you take a crack at it. You can make Shelly work how you want now by just removing the line that checks for the symlink: https://github.com/yesodweb /Shelly.hs/blob/master/Shelly/Find.hs#L69

I'm afraid my Haskell skills are just at the beginning stages. I started to give it a shot, but fell flat on how to properly construct a sub-Monad (?). I'll just be patient. :)

Thanks, John

gregwebs commented 12 years ago

I think you want a Writer monad (use runWriter), but I think the monad is just going to build up a record, so a record-based interface is probably where I would start.

jwiegley commented 12 years ago

Greg Weber notifications@github.com writes:

I think you want a Writer monad (use runWriter), but I think the monad is just going to build up a record, so a record-based interface is probably where I would start.

Here's an idea: Create a DSL using the Free monad, to describe what it is you want to find, and it's context. Shelly could traverse this AST to determine how many stat calls are required, the most optimal properties of the search, etc., and then do the impure applications in the most efficient way.

findWhen :: ShPred Bool -> FilePath -> [FilePath]

findWhen's first argument would be a function that takes a FilePath, and returns a new "ShPred Bool" monad. I could then do something like this:

match :: (FilePath -> Bool) -> ShPred Bool

findWhen (match $ (== "foo") . filename) "/tmp"

Note that the Sh monad doesn't come into play in the call to match, since ShPred simply describes the desired search.

In addition to matching, ShPred would support declarative statements giving properties for the search:

findWhen (\path -> do followSymlinks
                      crossFilesystems
                      consider path
                      match $ (== "foo") . filename) "/tmp"

(I can imagine followSymlinksWhen, crossFilesystemWhen, etc).

If I test for something to be a regular file, Shelly would know it doesn't have to pass directories to the next matcher. Of course, there'd need to be a pure version of test_f in ShPred:

findWhen (\path -> do consider path
                      match test_f
                      match $ (== "foo") . filename) "/tmp"

How does that sound?

John

gregwebs commented 12 years ago

I like the idea of re-using the filter to set options. test_f, etc can't be made pure unless find is redesigned to hold the necessary file stat information with each file.

jwiegley commented 12 years ago

Greg Weber notifications@github.com writes:

I like the idea of re-using the filter to set options. test_f, etc can't be made pure unless find is redesigned to hold the necessary file stat information with each file.

I was thinking that the "filter" function passed to find would be a pure value, in the ShPred free monad. The find function would then have an two-pass "interpreter" for this monad: Pass one would gather information about what impure actions need to be taken (how many stat calls, whether to descend into a symlink, etc), and then the second pass would actually perform those actions to achieve the test.

So the "test_d" in the predicate isn't a function that gets called, it just builds up the predicate's AST. It's the ShPred interpreter that will do a stat and then test for directory. I don't the impure version of test_d would even be called in this scenario.

John

jwiegley commented 12 years ago

Greg Weber notifications@github.com writes:

I like the idea of re-using the filter to set options. test_f, etc can't be made pure unless find is redesigned to hold the necessary file stat information with each file.

Hi Greg,

I've created a version of Find.hs that uses Arrows for the predicate function instead of monadic functions in Sh.

http://ftp.newartisans.com/pub/FindArrow.hs

The FindArrow type in that file observes and maintains meta-state, such as:

Whether Stat needs to be called
Whether a directory's entries needs to be read
Whether symbolic links should be traversed
Whether filesystem boundaries should be crossed

Here is the type declaration:

data FindArrow a b =
  FindA {
    callsStat        :: Bool,
    readsMembers     :: Bool,
    traverseSymlinks :: Bool,
    crossFilesystems :: Bool,
    runFindArrow     :: a -> b
  }

Meta-state is changed by using functions cognizant of the arrow. Right now I've only implemented one to show proof of concept:

isDirectoryA :: FindArrow FilePath (Sh Bool)
isDirectoryA = FindA True False False False $ test_d

We'd likely use a combinator for this pattern:

match = FindA True False False False

Here's an example of a predicate which calls stat for each filesystem entry:

statCallingPred :: FindArrow FilePath (Sh Bool)
statCallingPred = proc path -> do
  isDir <- match test_d -< path
  returnA -< isDir

Here's a predicate that doesn't need stat to be called at all:

statFreePred :: FindArrow FilePath (Sh Bool)
statFreePred = proc path -> do
  returnA -< return $ filename path == ".git"

Again, for typical predicates, like comparing against filename parts, a family of combinator functions would be preferable to this kind of boilerplate.

Finally, here is the sample code that uses them:

main :: IO ()
main = do
  print $ callsStat statCallingPred
  print $ callsStat statFreePred
  files <- shelly $ findWhenA statFreePred "/Users/johnw/Projects"
  print files

The result:

*FindState> main
True
False
[FilePath "/Users/johnw/Projects/emacs/.git",FilePath ...]

If this is of interest to you, I'm sure we can find better ways of fleshing it out and optimizing it.

John

gregwebs commented 12 years ago

This looks great! One concern I have is that shelly is being used by new users. Arrows are actually something that I have barely used (I think only in Hakyll). Applicative on the other hand is much more common place. So if it is possible to configure with Monads + Applicative or otherwise something more beginner friendly that would be preferable. I am a bit snowed in now, but will be able to look in depth at least by this weekend.

jwiegley commented 12 years ago

Greg Weber notifications@github.com writes:

This looks great! One concern I have is that shelly is being used by new users. Arrows are actually something that I have barely used (I think only in Hakyll). Applicative on the other hand is much more common place. So if it is possible to configure with Monads + Applicative or otherwise something more beginner friendly that would be preferable. I am a bit snowed in now, but will be able to look in depth at least by this weekend.

Hmm, it would seem there is no good way with Monads to maintain the sort of meta-state we need, for doing the sort of optimizations we want to do. This exact type of application is one of the motivating use cases for Arrows, and has been treated in some detail here:

http://ertes.de/new/tutorials/arrows.html

See his discussion on "static meta-information". With arrows, we can know from the arrow value before executing it whether stat is required; but with Monads, we can't know that stat is needed until effects have been computed at least once (if, say, we were using the State monad, and had defined "match" to get/put a "needsStat" boolean in the current state).

I do agree that forcing arrow notation may be a bit heavy-handed, but in fact it does provide the perfect abstraction for a pure predicate function that is able to operate in an impure context. For the casual user I think it's a pretty minimal invasion with the right helper functions. But for the power user, it opens up whole new vistas of expressiveness.

Here's a predicate that matches on directories that are both readable and executable:

findWhenA (match test_d >>> match test_x >>> match test_r) "~"

This would be impossible to express in point-free style using monads and applicative. Here is the best I could come up with:

findWhenA (\path -> do
                 (\x y z -> x && y && z) 
             <$> test_d path
             <*> test_x path
             <*> test_r path) "~"

But this is not as easily composable. I have to encode logic in the inner lambda, and tie it to a specific number of operations. With arrows, I can build up a library of combinatorial predicates and compose them at will:

isDotGit :: FindArrow FilePath (Sh Bool)
isDotGit = proc path -> do
  returnA -< filename path == ".git"

isReadableDirectory :: FindArrow FilePath (Sh Bool)
isReadableDirectory = match test_d >>> match test_r

findwhenA (isReadableDirectory >>> isDotGit) "~"

Plus, I still get the optimization benefits: I know before performing any filesystem I/O whether "isDotGit >>> isReadableDirectory" is going to require a call to "stat". And we can add other meta-inspections in future without any changes to the existing interface.

John

gregwebs commented 12 years ago

Thanks for thinking about the alternatives. One more thing to look at would be filemanip which uses a monadic DSL. Initially I thought it quite heavy-handed to be points-free without mentioning the filename, but I think it is helping perform some of the optimizations around avoiding file stats. What I was hoping to have was a find where one could use the normal functions like filename and extname, rather than ones that duplicate them for the purposes of the DSL. We can actually create typeclass instances for functions that require parameters, so it may be possible to use the normal functions as find predicate combinators and even stay points-free.

jwiegley commented 12 years ago

Ah, I see, creating a new family of operators to allow points-free expressions in the Monad. Is it really optimizing calls to stat?

Also, why not just using filemanip, then, rather than duplicate that functionality in Shelly?

jwiegley commented 12 years ago

On further reflection, there is no reason why the filemanip-style operators could not be implemented on top of arrows. This would hide arrows for all of those use cases, and yet provide that avenue for those who wish to explore it. For example:

type ShFind = FindArrow FilePath (Sh Bool)

(&&?) :: ShFind -> ShFind -> ShFind

Now you can join two predicate arrows with a short-circuiting (&&). And likewise for ==? and the other types of operators.

jwiegley commented 12 years ago

Greg,

I've been thinking more about how to optimally use monads instead of arrows, and I think I've come full circle back to using a Free monad.

The problem of optimizing pattern matches against structured data is one that is pretty well solved. The Clang compiler team just created their ASTMatcher DSL, for finding subsets within a C++ AST and extracting desired values from it. The XML world has XPath. Here's how I'd look for all files named .git/config lacking mode 0600 in my home directory, assuming the filesystem contents had been read into an XML schema:

/dir[name == "Users"]/dir[name == "johnw"]//dir[name == ".git"]
  /file[name == "config" && mode != "0600"]

In both this and the ASTMatcher case, the predicate doesn't do the matching directly. If that were the case, every single filesystem entry would have to be visited, and the predicate called for each. Rather, this approach yields a data structure which the traversal algorithm consults as it goes along, to direct and prune the candidate set. In the XPath case, recursive traversal only begins under /Users/johnw, and no files are considered until we're under a ".git" directory. Further, stat() is never used -- only readdir() on directory entries of directories. The only optimization potential lacking in this XPath query is an indication that I'm not interested in recursing under the ".git" directories themselves for other ".git" directories.

Now imagine the following DSL in the Free Monad:

do isDirectory
   name == "Users"
   descend
   isDirectory
   name == "johnw"
   traverseSymlinks
   recurse
   isDirectory ||? isSymlink
   name == ".git"
   truncate
   descend
   isRegularFile ||? isSymlink
   name == "config"
   mode /= 0600
   end

This would assemble a series of instructions to be interpreted by findWhen as it considers what to do at each step along the traversal. The "truncate" instruction informs it to abort any recursion currently in progress (at that level). (We could even have another instruction which uses continuations to abort back to the context that started the recursion, pruning that entire directory tree all the way back to /Users/johnw). The "traverseSymlinks" instruction tells it that beyond that point in the search, it's OK to recurse through directory symlinks.

The advantages of this approach are that it's monadic, and allows for not only powerful predicates, but contextually aware predicates, something that gets very awkward to do with GNU find on the command-line (but which XPath does quite nicely).

The downside is the extremely undiomatic nature of this code compared to, say, filemanip. If we assume an implicit "recurse" at the top of every predicate, then looking for all ".git" directories in my home directory (without recursing into any .git) would be:

findWhen (do traverseSymlinks; name == ".git"; truncate)
         "/Users/johnw"

What do you think of this direction? It also keeps the predicate function entirely pure. The Sh monad plays no part there now. And we can build alternate interpreters: for example, to print a debug trace of how the predicate is being used during the traversal, to find out if it's really being optimal, or why it's not finding something we know to be there.

John

gregwebs commented 12 years ago

This looks very interesting also. The reason I wrote my own finds was to have something that fit in with shelly without having the user learn a new family of functions for the finding but instead just use existing shelly functions. I am wondering if shelly could execute the data structure you have in mind that describes the find traversal. The actual DSL above and/or the arrow version would be in a separate package that I can point users to. Perhaps shelly-extra if you don't want to maintain your own package. I am trying to keep shelly itself small & consistent. It might be ok for its finder to be inadequate for power users if I just direct them on how to use a better package.

jwiegley commented 12 years ago

Greg Weber notifications@github.com writes:

This looks very interesting also. The reason I wrote my own finds was to have something that fit in with shelly without having the user learn a new family of functions for the finding but instead just use existing shelly functions. I am wondering if shelly could execute the data structure you have in mind that describes the find traversal. The actual DSL above and/or the arrow version would be in a separate package that I can point users to. Perhaps shelly-extra if you don't want to maintain your own package. I am trying to keep shelly itself small & consistent. It might be ok for its finder to be inadequate for power users if I just direct them on how to use a better package.

Actually, I've been meaning to write an abstract library for reading, searching, writing and comparing arbitrary hierarchies for some time (like, reading a filesystem into a database, and then comparing a later version against it to see how the files have changed). For such a library I need a querying facility that can do pruning and behave optimally.

So I think that if I solve this problem in the general, I can create a finding library from it that will then kill two birds, plus give me some facilities I've been wanting in other areas.

John

gregwebs commented 11 years ago

I like the free monad version (perhaps free applicative might be more appropriate, but monads make for nicer DSLs) and I think it could be included in Shelly by default.

gregwebs / Shelly.hs

Allow traversal of symbolic links with find* #21