jprichardson / node-klaw

A Node.js file system walker with a Readable stream interface. Extracted from fs-extra.
MIT License
317 stars 41 forks source link

Not traversing subdirectories if some filter functions applied #17

Closed manidlou closed 1 year ago

manidlou commented 7 years ago

I was playing with #11 to see if I can find a solution, then I noticed another issue if filter function applied. This is different than #11 that the root directory is part of the result array although filter function is applied. This issue is about not traversing subdirectories at all if some filter functions applied, such as the following example.

Imagine we have

tmp
  |_dir1
  |   |_foo.md
  |_bar.md
  |_baz.md

and we want to get only .md files. If we use it like

var filterFunc = function (item) {
  return path.extname(item) === '.md'
}
var items = []
klaw('tmp', {filter: filterFunc}).on('data', function (item){
  items.push(item.path)
}).on('end', function () {
  console.dir(items)
})

the result array is ['tmp', 'bar.md', 'baz.md'].

So, I checked the code again and based on what I understood it happens because when a filter function is passed, all contents of the root directory pass through the filter function and since return path.extname(item) === '.md' fails for all subdirectories , so none of them will be read. Therefore, only items in the root directory itself are returned. Please correct me if I am wrong.

Edit

However, if we run the same example using pipe for filtering, everything is just fine. Apparently, the problem only arises when filter function is used.

var filter = through2.obj(function (item, enc, next) {
  if (path.extname(item.path) === '.md') this.push(item)
  next()
})

var items = []
klaw('tmp')
  .pipe(filter)
  .on('data', function (item) {
    items.push(item.path)
  })
  .on('end', function () {
    console.dir(items) // => ['bar.md', 'baz.md', 'dir1/foo.md']
  })
manidlou commented 7 years ago

In general, wouldn't be better to remove/deprecate the use of filter function and instead point users to use piping for filtering?

It seems like using filter function in a stream interface can be problematic. Please see my comment above. By removing/deprecating the filter function, users still be able to achieve the filter functionality by using either through2 or node native streams with correct results.

What are your thoughts on this?

amadsen commented 7 years ago

The only reason to use the filter function instead of piping through a filter stream would be to avoid calling lstat on paths you don't care about. In many cases this will be a premature optimization, but it can be useful.

This is actually also the reason the filter function causes not traversing subdirectories - they have been filtered out by name before checking to see if they are a directory.

I would argue that this is expected behavior and document it specifically. For most use cases (like the one described in this issue) a filter stream is what you actually want. For some specific use optimizations where early termination (especially skipping whole directory trees) is important a piped filter stream will not work and the filter function is exactly what is needed.

RyanZim commented 1 year ago

Closing as expected behavior, PR welcome to clarify docs here.