soldair / node-walkdir

Walk a directory tree emitting events based on the contents. API compatable with node-findit. Walk a tree of any depth. Fast! Handles permission errors. Stoppable. windows support. Pull requests are awesome. watchers are appreciated.
MIT License
130 stars 22 forks source link

Hard linked files are sometimes missed #29

Open lbmiller opened 8 years ago

lbmiller commented 8 years ago

In a situation where two directories have files of the same name which are hard links the 2nd is not reported. For example the tzdata package installs many copies of the same files into /usr/share/zoneinfo such that files like /usr/share/zoneinfo/GMT0 and /usr/share/zoneinfo/Etc/GMT0 (and many other examples) are hard links.

Both GMT0 and Etc/GMT0 should be reported when traversing /usr/share/zoneinfo.

lbmiller commented 8 years ago

The file's basename and inode are tracked in a hash, and another file is not reported if it has the same inode and basename. I believe that map is to help when following a symlink, if the target directory would also be traversed on its own. In this case the match is a false-positive. One solution (though not perfect) is to ignore the match in the hash if the target file is actually a hard link. This helps narrow one corner case, but still leaves other more obscure corner cases.

I am providing two possible (imperfect) solutions for your review.

  1. Only suppress the duplicate report if the file is a real hard link (nlink > 1) https://github.com/lbmiller/node-walkdir/tree/hardlinks
  2. Same, but disabled by default; opt-in using the {report_hard_links:true} option https://github.com/lbmiller/node-walkdir/tree/hardlinks-opt

I suspect a more complete solution would require keeping track of symlinks encountered during the scan and then resolving and reporting any dangling symlink targets after all other files are reported.

soldair commented 8 years ago

I'm not sure i understand how to use nlink like you propose. But ill check it out. thanks for the examples.

The reason we keep the hash is to prevent infinite recursion when following symlinks to directories. Generally the behavior is too broad now. In that we never report a file we have seen before.

We should probably change the behavior to never list a directory we have listed before and bump the major version.

the basename thing seems like a bug disaster waiting to happen on windows where ino is empty

MarkDuckworth commented 7 years ago

For what it's worth I'm seeing similar behavior walking node_modules on a Windows 8.1 machine. I'm not sure if hard links are at play in my case, but we see files with the same name occasionally skipped. As a simple example: consider a node_modules folder that includes node_modules/a_module/index.js and node_modules/b_module/index.js, one of these files may be skipped.

lbmiller commented 7 years ago

@MarkDuckworth I'm curious whether either of my proposed fixes would work in your situation. Are you able to try them? Also in your case are the contents of the two files the same?