leo-arch / clifm

The shell-like, command line terminal file manager: simple, fast, extensible, and lightweight as hell.
https://github.com/leo-arch/clifm/wiki
GNU General Public License v2.0
1.29k stars 41 forks source link

p and pp could show the number of files #273

Closed muellerto closed 2 months ago

muellerto commented 3 months ago

Is your feature request related to a problem? Please describe. The p and pp commands could show the number of contained files when the operand is or points to a directory.

Describe the solution you'd like The p command should count the direct children while pp should be full recursive with the same algorithm as used for determining file sizes. There's then just an additional row needed in the output:

files: 3736

or clearer:

subdirs/files: 385/27362

OK, it could still be more detailed when symbolic links are also respected:

subdirs/files: 385/27362, symlinked subdirs/files: 12/5

A question is about following the symlinks. But the same question you already had when determining the over all size of the contained files. I wouldn't implement something new here, so you can give the same answer here just for counting.

Describe alternatives you've considered I saw the stats command showing a number of files but this seems to be always the current directory and not recursive.

One of the easiest alternatives is to open fzf with no parameters in such a directory, this shows the number of files very fast. You can also call find | bat to get a number (bat without parameters shows a line number).

Additional context It's not that I count my files all the day. But sometimes you need to know if there has come an additional file or not, so you need exact numbers.

leo-arch commented 3 months ago

Hey @muellerto! I like the idea.

In its most basic form (number of files, only 1 level of recursion, without distinguishing between files and dirs), it can be done straightaway: we already have this info in the files counter.

Distinguishing between files and dirs (still 1 level of recursion) already requires extra computation (open dir, check file type, count).

Full recursion, well, that's quite harder. Of course we can rely on some external tool (we do this for recursive size after all), but I try to avoid this as much as possible to minimize dependencies. As a plus, you always have the symlinks problem: what if a symlink points to a parent dir of the current dir?, for example.

Let me see what I can do.

muellerto commented 3 months ago

In its most basic form (number of files, only 1 level of recursion, without distinguishing between files and dirs), it can be done straightaway: we already have this info in the files counter.

Distinguishing between files and dirs (still 1 level of recursion) already requires extra computation (open dir, check file type, count).

Thought a simple stat() call would be sufficient. stat() gives you the size, the time stamps and the file type of any directory entry without opening it.

Full recursion, well, that's quite harder. Of course we can rely on some external tool (we do this for recursive size after all), but I try to avoid this as much as possible to minimize dependencies.

I haven't looked into your code until now. But writing a recursive algorithm based on stat() calls isn't space science. stat() gives you probably all you need. (What you don't get from stat() is where a symlink points to.) There's no need for an external tool here. I checked find, but it seems that there's no -count option, so you need a second helper. A way which gives at least an ordinary number is find | wc --lines.

As a plus, you always have the symlinks problem: what if a symlink points to a parent dir of the current dir?, for example.

Yeah, but that's always a non-trivial problem. Could imagine that a lot of tools (find, ls -R, cp -R ...) will also make trouble in such a situation. stat() tells you a directory entry is a sub directory, an ordinary file or a symlink. Counting these three types in three separate integers would be a good start of this topic.

BTW: you could even put this into threads. This is where fzf gets it's speed from. When the start directory has five sub directories you could start five threads in parallel for this.

leo-arch commented 3 months ago

Thought a simple stat() call would be sufficient

It is it, but is already extra calculation. As it stands now, we have the number of files in directories, but not their type (nor any other file specific info). To do this, we need to open the corresponding directory and run stat(2) over each of them to get the file type, among other things.

writing a recursive algorithm based on stat() calls isn't space science

True. Indeed, we already have an algorithm to recurse into directories (xdu), but is only used as a basic du(1) alternative to get total dir sizes. Not much is required to make it count files (including file types and other info). But, again, this is extra calculation.

Counting these three types in three separate integers would be a good start of this topic.

Agreed, provided you're not following symlinks, which is what pp is supposed to do, so that the symlinks problem cannot be overlooked that easily.

you could even put this into threads.

Granted. This would be a lot more efficient. But even before getting into threads, we need to solve the above issues. First, we write the basic algorithm, and only then we split it into threads.

As you can see, while not rocket science, true, it's far from trivial (specially when it comes to recursion). But, again, it's a nice feature to have.

leo-arch commented 3 months ago

In case you're interested, the code for the p/pp command is here:

https://github.com/leo-arch/clifm/blob/8b6af1f04f692ca5effb2d04395f4cb592af3653/src/properties.c#L1695

The amount of files in directories is stored in the filesn field of the global file_info struct, and is retrieved while processing files to be listed on the screen (list_dir() in listing.c) using the count_dir() function, in aux.c.

leo-arch commented 3 months ago

First implementation. A few notes:

  1. Count files only via pp (p does not recurse and thereby should not descend into directories at all).
  2. The symlinks problem is a no-problem: if a symlink points to a parent directory of the directory being analyzed, then we don't care (if falls outside of our current scope) and the link is counted as a simple file; and it if points somewhere under the directory being analyzed, this content will be analyzed anyway, because our function is recursive, so that the symlink is again counted as a simple file.
  3. The code is fast enough and takes less than 100 locs (including blank lines). Nice!

For the time being (recursion is quite tricky), the code is hidden behind a compile-time macro (CLIFM_DIR_INFO). To unblock it compile as follows:

git clone ...
cd clifm
gcc -DCLIFM_DIR_INFO -o clifm src/*.c -O2 -lreadline -lcap -lacl -lmagic
./clifm

Give it a try and let me know what you think.

muellerto commented 3 months ago

Give it a try and let me know what you think.

Yeah! I slept and you worked.

Looks good. There seem to be two algorithms for getting the over all file size and for getting the numbers. The first one seems to be generally slower (Update: on Linux it's much faster) - counting the files is blazing fast. But it can also be a consequence of having all the infos in cache after the first run. The influence of the cache does always matter in timing aspects.

The Windows Explorer (I never take it for anything, but it's always a reference implementation because everyone has it) shows mostly some files more. If clifm has 72945 the Explorer says 72952. But in the tree I have also some links, I don't know how this is currently handled. The number of directories is always equal.

I think you need an option for derefencing symlinks or not. pp can be used to see if a medium has enough space for a bunch of files. But the information must be calculated under the aspect of how much stuff will then really be copied. When you copy with cp -iRp (p is --no-dereference) you must also have the chance to run pp in both algorithms somehow with a --no-dereference option. Yes, symlinks make it always complicated. But they are so useful ...

muellerto commented 3 months ago

An illustration for my last paragraph:

1

The two directories should have the same content but the original 18_X has symlinks in it, the copy 18_X_COPY has dereferenced symlinks (ordinary directories and files), the number of files is much bigger. The copy has been made with cpCmd=5 which is rsync -alP. - All understandable, but we have here an example where pp and c do not match. The size pp calculates is 50 MB smaller than what c needs.

leo-arch commented 3 months ago

Hi @muellerto!

There seem to be two algorithms for getting the over all file size and for getting the numbers.

True. We use du(1) to get dir sizes (since we can consider this tool as the most reliable way to do the job). But, since du does not count files, we need to use a built-in algorithm (quite fast, as you can see) for this job.

If clifm has 72945 the Explorer says 72952

Not sure how Explorer is doing the count exactly, but I'll keep an eye on it. Currently doing some tests with Thunar.

The two directories should have the same content but the original 18_X has symlinks in it, the copy 18_X_COPY has dereferenced symlinks (ordinary directories and files), the number of files is much bigger.

As far as I can see, this is actually expected: symlinks (in your original directory) should be taken as ordinary files, while, in the second directory, since these symlinks are dereferenced, directories are followed and files in each of them counted (which explains the size/count difference). In favor of this approach we have the du(1) output: the first directory is smaller than the second one precisely because of this. Needless to say, you can do your own tests running du on each of those directories and taking note of the results.

Btw, this is the algorithm we're using to count files. Dead simple, short, and efficient: according to my tests, it's even better than Thunar's (which, unlike our implementation, does not count directories for which we have no read access, though the directory is actually there and should definitely be taken into account). https://github.com/leo-arch/clifm/blob/e7f723c9ae4c80ec769c27db32fa392dc82cba90/src/properties.c#L1698

Please take a look at the code and, if possible, point out where can it be improved. I think its logic is correct: walk the directory tree, count dirs (even if we cannot access them) and files separately. If stat(2) fails for some entry (in which case we cannot know the file's type), we count it as an ordinary file (instead of ignoring it), simple because an ordinary file is like a general type for all file types. Do note that an exclamation mark at the left of the counters means that some directory couldn't be read, and that the count might thereby not be accurate.

I'll keep testing this feature. It would be really nice to have some faithful reference, like an old/reliable program doing exactly this against which we can compare our results. Still looking for it.

muellerto commented 3 months ago

There seem to be two algorithms for getting the over all file size and for getting the numbers.

True. We use du(1) to get dir sizes (since we can consider this tool as the most reliable way to do the job). But, since du does not count files, we need to use a built-in algorithm (quite fast, as you can see) for this job.

Yes, but when you have two algorithms you have always the chance to have different behavior. It would be easy now to integrate the summing up of the file sizes into the same stat()-based algorithm which also counts these files and you could then make this recursive algorithm behave as the user needs it. When the user wants dereferenced symlinks you could provide this, if not then not, the results would always be consistent. And, as I said, the user could indeed need both, this depends on how he wants to do other commands like copying or moving.

If clifm has 72945 the Explorer says 72952

Not sure how Explorer is doing the count exactly, but I'll keep an eye on it. Currently doing some tests with Thunar.

I'm sure this depends also on these symlinks. I made a check now and saw that a symlink to a directory isn't counted by the Explorer and is also not dereferenced, it's just not there. A symlink to a file is counted, don't ask me why. - This is nothing to understand as a strict reference, it's just the way the M$ guys do it. I think it's rather important to have a behavior which is transparent, explainable, reliable and useful.

Line 1698 in e7f723c dir_info(const char dir, int status, struct dir_info_t *info)

Please take a look at the code and, if possible, point out where can it be improved.

I will do so.

I'll keep testing this feature. It would be really nice to have some faithful reference, like an old/reliable program doing exactly this against which we can compare our results. Still looking for it.

The problem is that you will probably not find a majority among the implementations which could be declared as a reference. There are always differences.

Once I got a bug description about my application showing totally!!! wrong!!! file times. Then I checked what the Explorer displays because this is always what they call "right", and sure, the Explorer had other file times. But the dir command of cmd.exe had again different times. And the file manager I used at that time (Multi Commander) showed still other file times. They all did mysterious conditional calculations in- or excluding DST. I learned my file times were all correct UTC, but one or two hours wrong for the user. So - which one is right?

Probably you can investigate how others do it. But since you provide copy and move operations (as every file manager does) it would be useful when pp could keep the behavior of the actual cpCmd and mvCmd in mind. When pp counted 347 files and then you copy them somewhere you probably want to find 347 files when copying is completed. Also with the file sizes: when pp said the user has 59.8GB you probably expect that it should fit on a 64GB memory stick, but this depends on how the copy operation is done. (OK, when the user uses something else to make the copy then he must expect that the counts and sizes have differences.)

leo-arch commented 3 months ago

I think it's rather important to have a behavior which is transparent, explainable, reliable and useful.

Agreed. Let's start with this.

In a directory named DIR we have the following files:

dir1/ (empty dir) dir2/ (no access dir - owner/group root) dir3/ (no access dir - owner/group root) file (regular file) file2 (regular file) parentlnk (symlink to a parent directory of DIR) dirlink (symlink to dir1/) link (symlink to file) link2 (symlink to link) brlink (broken symlink)

What's the most straightforward/natural way to count subdirs/files in this scenario?

App Result Observation
Thunar 2 subdirs / 7 files Only accessible dirs are counted (dir2 and dir3 are ignored). Also, DIR itself is counted, which is quite counterintuitive.
Dolphin/Krusader 5 subdirs / 5 files dirlink is counted as a directory, even if it points to dir1, which was already counted. Also parentlnk is counted as a directory as well, even if it points outside the scope of DIR. This is because, unlike Thunar, Dolphin uses stat(2) (instead of lstat(2)), so that all symlinks are dereferenced.
tree(1) 6 subdirs / 5 files Same as Dolphin/Krusader, but it adds DIR to the directories count.
Space-fm 2 subdirs / 9 files 2 dirs: DIR and dir1. dir2 and dir3 counted as files (?)
Double Commander 4 subdirs / 2 files 4 dirs: DIR plus dir1-dir3. 2 files: file and file2. Symlinks totally ignored.
Qtfm 3 subdirs / 3 files 3 subdirs: dir1-dir3. 3 files (what the heck!). Symlinks seem to be totally ignored.
Pantheon-files 3 subdirs / 7 files
Clifm 3 subdirs / 7 files Like Thunar, we use lstat, but unlike Thunar, we do not count DIR (though, like Dolphin, we do count dir2 and dir3).

Totally mind blown. 7 different results (over 9 tested implementations) to count only 10 files! There's even no agreement on the total number of files!

When pp counted 347 files and then you copy them somewhere you probably want to find 347 files when copying is completed.

Sounds quite reasonable. But let's consider this point later.

To make the test yourself, please use the latest version (fixed a little bug in the new files counter).

leo-arch commented 3 months ago

It would be easy now to integrate the summing up of the file sizes into the same stat()-based algorithm

I will, provided our size algorithm behaves as expected (i.e., like du), and once we get the right implementation for the files counter.

muellerto commented 3 months ago

Totally mind blown. 7 different results (over 9 tested implementations) to count only 10 files! There's even no agreement on the total number of files!

That's what I mean :) And they all say they do it right.

On monday I could provide data from the Explorer, the Wndows (DOS) dir and tree commands, MultiCommander and WinSCP. And we have still eza and erdtree (also on Linux). eza has at least a --total-size parameter for determining the total size of a sub directory. After all you have then not 7 but 13 different implementations :)

I would generally exclude this parentlink. If you find such a thing you should report this, this is a structural error in the file system, not sure for what purpose it exists. There's a good chance that the user doesn't even know this. Note: this situation could also exist completely under DIR: a sub directory contains a symlink to one of it's parents.

leo-arch commented 3 months ago

On monday I could provide data from the Explorer, the Wndows (DOS) dir and tree commands, MultiCommander and WinSCP

That would be great.

And we have still eza and erdtree

eza does not provide a files counter. Here's erdtree's output:

3 directories, 2 files, 5 links

The result is basically the same provided by Pantheon-files/Clifm, except that it uses a separate field counting symlinks. I think this is a nice approach.

leo-arch commented 3 months ago

Second implementation: show total, dirs, files, and links. Example:

Items:     10 (3 directories, 2 files, 5 links)

Total agreement with erd --hidden --no-ignore.

muellerto commented 3 months ago

Found a difference, see the lines with the red dots: 1

The above output is from erd without any options on the same directory. erd counts one directory with eight files less. And I don't even have symlinked directories here, all are real.

The sizes are also interesting: erd counts more bytes in less files. How come? - Ah, attention, erd counts 4k blocks, the over all sum is dividable by 4096.

leo-arch commented 3 months ago

The above output is from erd without any options on the same directory

This is expected. According to the image, clifm is counting 1 extra directory (and those extra 8 files, I bet, are coming from this directory). The thing is that erd, without any options, does not show/count hidden files and respects .gitignore files (while clifm does count hidden files and does not respect .gitignore files).

This is why I said: Total agreement with 'erd --hidden --no-ignore'.

A good question is whether we should count hidden files even if the user has ShowHiddenFiles set to false. Should pp, when counting files in directories, respect this option? I'm inclined to think that it shouldn't, but I'm open to consider options.

erd counts more bytes in less files. How come? - Ah, attention, erd counts 4k blocks

Yes, by default erd calculates physical (i.e. blocks consumed on disk) sizes, instead of logical (apparent) sizes. Run with --disk-usage=logical to make it calculate apparent sizes instead.

muellerto commented 3 months ago

Yes! When I give erd these options the result is indeed equal. (This explains also a much bigger difference I had in another directory, pffff.) OK, looks good, and so fast. I'll check this tomorrow on my Windows machine which has not only an SSD but also a real hard disk.

leo-arch commented 3 months ago

Btw, CLIFM_DIR_INFO isn't required anymore. Since 1.18.1 the new features (including MIME type info in p/pp) are now part of the regular build.

muellerto commented 3 months ago

I checked counting on Windows. I can say the following:

  1. A general problem is that it's not so easy to revoke the access rights from dir2 and dir3 after creating them because I'm the Administrator (for reasons ...) I always can do what I want with them. So I had to create a second local user for that.
  2. I saw that an application I used to create parentlink denied it with an error message (good so!). Another application did it so I had indeed this construction.
  3. The most applications use all the same algorithm provided by a DLL belonging to the Explorer, so they all show the same results in the same dialog window.
  4. The links are not mentioned there. Symlinks are still a stepchild under Windows. The Explorer shows them but does still not allow to create them until today, you need 3rd party applications for that. The links are counted as files. I guess dirlink is counted as directory.
  5. The builtin dircommand of cmd.exe and the tree command of Windows do both dereference parentlink which doesn't end. They don't get any results.

Explorer, MultiCommander, the 7z file manager, WinSCP: 3 dirs / 5 files

Clifm shows the following: 1

There seems to be a display problem (uninitialized variable?). I saw that this is related to the directories with no access. When the user has access to them it looks good.

leo-arch commented 3 months ago

Thanks for your tests @muellerto.

Explorer, MultiCommander, the 7z file manager, WinSCP: 3 dirs / 5 files

It's not clear what files these apps are including and what they're excluding. Anyway, it doesn't seem to be an intuitive way to count files at all.

There seems to be a display problem (uninitialized variable?)

Close, but not exactly. It seems that ConEmu (no issue with the Cygwin Terminal) does not like some escape codes. Not a big deal. Let me see how to fix it.

muellerto commented 3 months ago

There seems to be a display problem (uninitialized variable?)

Close, but not exactly. It seems that ConEmu (no issue with the Cygwin Terminal) does not like some escape codes. Not a big deal. Let me see how to fix it.

I use mintty already for a long time now since ConEmu has still these color problems. mintty is very robust but allows only one window.

I made an issue for ConEmu but the developer says all the others are doing it wrong, ConEmu behaves like a real terminal must do it, and he will not change it.

Another and because of it's prevalence very serious candidate is wt (the Windows Terminal, this is not the black cmd.exe window, wt must be installed separately). wt shows your color schemes also very well (but I get painful eye cancer from its round corners ...)

leo-arch commented 3 months ago

I use mintty already for a long time now

Don't worry, I'll fix it. EDIT: Done.

I made an issue for ConEmu but the developer says all the others are doing it wrong

In that case they shouldn't advertise their terminal as xterm (via TERM), because in this case I have no way to tell whether we're running on a terminal doing things the "right" way (ConEmu) or the "wrong" way.

round corners

It's all the rage now, the latest and greatest. So sexy, so useless.

muellerto commented 3 months ago

I felt like a hero and did a pp on /c/Windows:

1

This took about 7 minutes on an SSD and du made indeed a stack dump very soon.

The Explorer says the over all size is 32.4 GB (34873527413 Bytes). This is less than I thought, I have much bigger directories (but with fewer files). The Explorer says also I have there 189130 files and 51292 directories, so the sum would be 240422, this means clifm counted 165 more.

erd --hidden --no-ignore says 34604781724 Bytes in 50895 directories and 189128 files, 1 link. This means 604 fewer directories than clifm but 372 more files. Not sure how to interpret this.

Is the red ! intended?

As I said, I wouldn't fret too long about the differences. Important is that clifm itself produces comparable results when the user does it today and again tomorrow. The fact that another application counts differently is just something the user must live with.

leo-arch commented 3 months ago

This took about 7 minutes on an SSD

Kinda expected.

du made indeed a stack dump very soon

But this wasn't expected indeed. It would be nice to know where/why it crashed, to notify the issue upstream. But it's unrelated to us.

I wouldn't fret too long about the differences

Me neither. However, I do keep an eye on them because these differences might point to something we're not doing or doing wrong. For the time being, I think our code is quite simple and transparent, that is, reliable. However, there might still be some edge cases we're not handling appropriately. We need more testing to spot where these differences are coming from exactly. I'll test big directories, and, if a difference is found, continue with subdirectories, 'til the source of the difference is found. Then, if the root of the difference points to something we're not handling fine, we need to correct it.

Is the red ! intended?

Totally. It means that, while walking down the directory tree, we found some unreadable (mostly because of permissions) subdirectories, and that, thereby, the count might not be accurate. This might be (at least one) source of difference: some implementations do not count these unreadable subdirs at all, others count them as files. We count them as what they are, as directories.

leo-arch commented 3 months ago

Examined my entire hard disk, and found a difference (the only one I could find):

When it comes to a symlink which cannot be read (both ls(1) and stat(1) fail with Permission denied), for example /proc/1/cwd, erd counts the file as a file, while clifm counts it as a link (which I think is more natural).

leo-arch commented 3 months ago

Found another (I'd say pretty reliable) way of testing our results: find(1).

Cmd Description
find DIR -type d \| wc -l Count directories in DIR
find DIR -type b,c,f,p,s \| wc -l Count files (block, character, regular, FIFO/pipe, socket) in DIR
find DIR -type l \| wc -l Count symlinks in DIR

Running this on both Linux and Windows, and comparing the results to those provided by clifm, gives incredibly close results: if there's a difference, it comes down to one or two files most of the time.

NOTE: Bear in mind that find counts DIR, while clifm doesn't.

muellerto commented 3 months ago

Checked that dd stack dump. The error log is: du.txt

The first four lines are: du: das Verzeichnis 'Windows\/CSC' kann nicht gelesen werden: Permission denied This means: du: the directory 'Windows\/CSC' cannot be read: Permission denied

There seem to be directories which indeed an Administrator cannot read because of missing permissions. It's not so easy under Windows. When I do a pp directly on this /c/Windows/CSC it doesn't crash. Gives just !0 bytes in !0 files. That's why I think the crash has nothing to do with this directory.

leo-arch commented 3 months ago

So, the crash happens with du? Clifm itself is not crashing, isn't it?

muellerto commented 3 months ago

Yes, it's du alone. I mean the MinGW port of du I have here. Works normally very good, I see the first time such a crash.


Your findalternatives lead to the following results on my /c/Windows:

Cmd result clifm pp difference
find DIR -type d | wc -l 51302 51502 +200
find DIR -type b,c,f,p,s | wc -l 189138 188937 -201
find DIR -type l | wc -l 1 1 0

Ha, can it be that you count also 200 files as directories or that find counts 200 directories plus the one link as files???

leo-arch commented 3 months ago

Your assumption is quite possible. When we find a directory/file on which lstat(2) failed, we count it as a file (despite the fact that it may be actually a directory: we don't know because lstat(2) failed). If there are 200 files on which lstat failed, this might explain the difference.

The whole Windows permissions thing is like another world: it just doesn't respect Unix-style permissions. I see system directories under /c colored as unreadable dirs (because you're not the owner, you are not in the directory's group, and the file does not grant permissions to others), and nonetheless you can actually read them and list their contents (while this doesn't hold for other directories, for which you haven't permissions either).

Can you replicate this difference with other directories?

EDIT: I'm not running clifm as admin (I don't even know how to do this on Windows).

muellerto commented 3 months ago

Can you replicate this difference with other directories?

One of my big working copies (including some symlinks) is as follows:

Cmd result clifm pp difference
find DIR -type d | wc -l 4542 4541 -1
find DIR -type b,c,f,p,s | wc -l 68071 69071 0
find DIR -type l | wc -l 12 12 0

Another one:

Cmd result clifm pp difference
find DIR -type d | wc -l 5363 5362 -1
find DIR -type b,c,f,p,s | wc -l 75170 75170 0
find DIR -type l | wc -l 8 8 0

/c/Program Files (du crashes again, find reports several "Permission denied"):

Cmd result clifm pp difference
find DIR -type d | wc -l 8340 8357 +17
find DIR -type b,c,f,p,s | wc -l 73004 72986 -18
find DIR -type l | wc -l 15 15 0

and this is almost an entire partition (350GB, no errors at all):

Cmd result clifm pp difference
find DIR -type d | wc -l 59515 59514 -1
find DIR -type b,c,f,p,s | wc -l 312880 312880 0
find DIR -type l | wc -l 3 3 0

We see two things:

Note: I did all tests starting from the same point - I used always the direct parent of a directory and did the run find or pp on the same ELN.

leo-arch commented 3 months ago

We're close, very close.

this mysterious -1 difference in the directories count

Absolutely. Not a problem.

there's probably something not good when it comes to errors because of missing permissions

I need to reproduce this issue, but I'm not sure how. On my Windows machine pp refuses to count files on system directories because I have no permissions (running with a regular account). I need to run as admin (don't know how yet) and reproduce the thing.

muellerto commented 3 months ago

The access problems happen on my machine also on Programs, Programs (x86) and ProgramData. The Windows directory is just very special because it has also a Temp and some open log files.

In my case it's complicated. My machine is in a Windows Domain, I log on as a Domain user. This is not a local user on my machine but one which is known and authenticated by the Domain Controller. I can also use this user to log on on all other machines in the Domain (about 100). And then there's a rule, I don't know where, that all Domain users are also automatic in the local Administrators group on every machine. So at the end I am not the local Administrator user but I'm in that local Administrators group.

You are probably a local user on your machine. Perhaps you can give yourself also this group membership (Administrators). Don't know if this is possible.

There's also another topic to think about: It can be needed that you must run all elevated. This means your shell (cmd.exe or whatever) runs in the context of an Administrator (similar to sudo). In the Start menu each application has a context menu on right click, there's probably a menu item called "Run as Administrator", see here for details. When your shell runs as Administrator also all child processes run as Administrator. Try this. You have much more rights then. (Don't think about security concepts and such things ...)

I guess these problematic directories are such ones where the owner is the local Administrator user and he revoked some access privileges for others. Such directories are created by the installation process when Windows comes onto that disk.

leo-arch commented 3 months ago

The access problems happen on my machine also on Programs, Programs (x86) and ProgramData

Couldn't replicate this. I've tried with these, but also with big directories like Windows and Users: nothing.

Here's what I've found so far:

  1. pp's results match exactly those of find most of the time (running with elevated permissions or not, it doesn't make any difference). When a difference pops out, it is somehow related to temporary directories, whose content seems to be process-dependent, that is, it changes from process to process, and thereby find outputs different results than pp. However, I'm still not 100% sure (I'd need to be able to know the exact content of these directories while running find, on the one side, and while running pp, on the other side).
  2. Since permissions on Windows do not respect the Unix-style, I've disabled, for Windows only, the permissions check done by pp (which otherwise refuses to count files/sizes on directories for which we actually have access).
leo-arch commented 3 months ago

Your find alternatives lead to the following results on my /c/Windows: Cmd result clifm pp difference find DIR -type d | wc -l 51302 51502 +200 find DIR -type b,c,f,p,s | wc -l 189138 188937 -201 find DIR -type l | wc -l 1 1 0

Ha, can it be that you count also 200 files as directories or that find counts 200 directories plus the one link as files???

Considering that the total number of files is the same for both find and clifm, the question is: why is clifm taking some files as directories, while they're taken by find as files?

I've been thinking about this a bit more. Clifm reports exactly 200 directories more, and 200 files less than find (the extra file counted by find as a file might be due to clifm not counting DIR). Now, the only way for clifm to count a file as a directory is:

  1. lstat(2) succeeded on the file.
  2. lstat(2), via the S_ISDIR() macro, identifies the file as a directory (no matter whether we have read permissions for it or not).

Since this procedure relies entirely on lstat, which is a syscall, we can be 99% sure that the file is actually a directory.

You can take a look at the code yourself: https://github.com/leo-arch/clifm/blob/4d1ee2aebc724b781d9220307815b5d86ed7ee11/src/properties.c#L1711

What's happening here then? Though this is only theoretical speculation (for I cannot reproduce this state), this is what I think: find is counting directories for which we have no read permissions (opendir(3) fails on the file), at least under some very specific circumstances (which I ignore), as files.

Alternatives I've considered:

  1. lstat(2) is lying: maybe on Windows, but it's not really likely.
  2. Clifm is counting some directories twice: in this case, the files counter shouldn't be affected at all, but it is.

Maybe I should take a look at find's source code to see what's actually going on.

muellerto commented 3 months ago

I have still another guess: empty directories. Maybe find counts empty directories as simple files. Don't know. But 200 empty directories under /c/Windows/ ??? Perhaps you can find out how many directories you find in clifm which don't have any children except . and ..

And then there can also be another thing: a directory entry could be counted twice, this means as file and as directory. Perhaps find counts some very special directories also as a file, so we have 200 files more.

Could probably been analyzed using the find output (without wc) and a debug output from clifm, then sort both alphabetically and then you have something to compare where you can see some more directories here and some more files there.

The problem is: I'm not here until next tuesday. I don't even have a laptop with me because where I will be is no network at all. So you must fight on your own :))

leo-arch commented 3 months ago

Don't worry. Have a good time!

leo-arch commented 3 months ago

New/faster files counter algorithm for pp

It's now 4.4 times faster than the old one! If the previous one was blazing fast, as you said, now you won't even see it counting. It can count files in my entire SSD in 1.16 secs (previously it took 5.15 secs).

Why is it that fast? Because we don't need to run stat(2) at all, which, as a system call, is expensive.

Note: This new algorithm will be used only provided the dirent struct populates the d_type field with the appropriate value (and this is platform dependent). If not, the old algorithm will be used instead.

In case you ask, no, this does not change the way we count files, it's just faster, so that whatever needs to be fixed, if something, is still there.

leo-arch commented 3 months ago

A bit more of light on this issue

I noticed a different recount of files when running both pp and find on my root directory: find always reports more files/dirs than pp. I traced the difference down to /proc. And here's the point:

  1. /proc is a virtual file system which, among other things, contains information (in the form of a bunch of files, approximately 250) about each running processes (for instance, the directory /proc/1/ holds information about the process whose PID is 1 - usually your init system).
  2. When you run find ... | wc ..., at least two new processes are spawned, which means that at least two new directories with approximately 250 files each are created (and automatically removed once these processes finish). Of course, this also applies to erdtree and any other program used to count files.
  3. But this does not happen with pp, for we're not spawning here any new process.
  4. As a result, the count of files is always different, but not wrong. Indeed, both are correct (though I tend to think that find's report is a bit biased, since we, as users, don't care about files created/destroyed on the fly, i.e., files that are just not there before and after running find: from the user's perspective, these files were never there, and counting them, thereby, doesn't feel that right).

This applies to Linux (and some Unix-variants). Not sure about Windows, but it's quite possible (at least something similar).

Does this explain the whole thing? No. But it's useful to bear it in mind: expecting the same results is a false expectation, and indeed, the truth is exactly the opposite (at least in this case of course).

muellerto commented 3 months ago

It's now 4.4 times faster than the old one! If the previous one was blazing fast, as you said, now you won't even see it counting. It can count files in my entire SSD in 1.16 secs (previously it took 5.15 secs).

Why is it that fast? Because we don't need to run stat(2) at all, which, as a system call, is expensive.

Be careful with the cache. When you want to have ordinary, comparable results, you must run you algorithm without du before. When du fails on Windows the following pp needs much more time, it's then not blazing fast.

muellerto commented 3 months ago

Windows doesn't have virtual file systems at all, especially no /proc. Also MinGW doesn't make one. (Don't know about Cygwin). But Windows has a lot of temporary files, most of them in the user profiles under /c/Users and in the global Temp directory under /c/Windows.

Windows has also (this is a difference to Linux) a lot of open files: all currently executed files (.exe, .dll, .cpl ...) are exclusively open during the program is running. This means some parts of the OS are always open.

And Windows has a lot of strange permissions and overlapping policies. You oftenly can't explain why something is not accessible, you must then start a real research for this.

The MinGW and Cygwin runtime libraries try to do their best to implement proper POSIX functions. BTW: Microsoft would have own and may be different implementations of all these functions when you would use the Microsoft compiler. We could try this. I have all that, I mean a Microsoft Visual Studio 2022, I work with it all the day. The community edition is free.

So, now I'm away. Happy Easter.

leo-arch commented 3 months ago

Be careful with the cache... When du fails on Windows the following pp needs much more time, it's then not blazing fast.

True. Even if I skip the file size code (du is not executed at all), the first time we count files takes much more than subsequent calls. However, this is not related to our code, but to some OS/filesystem-specific performance operation, which, as such, is not under our control. In fact, the same thing happens to find and other tools.

leo-arch commented 3 months ago

Windows doesn't have virtual file systems at all, especially no /proc.

However,

Windows has also (this is a difference to Linux) a lot of open files: all currently executed files (.exe, .dll, .cpl ...) are exclusively open during the program is running.

This means, as in the /proc example, that the amount of files depends on the amount of executed processes: running pp and find (or other files counting tool) will not produce the same results, simply because the filesystem has been changed on-the-fly by the kernel based on currently running processes.

And Windows has a lot of strange permissions and overlapping policies. You oftenly can't explain why something is not accessible

The good old MS style.

Microsoft would have own and may be different implementations of all these functions when you would use the Microsoft compiler. We could try this.

Ok.

muellerto commented 3 months ago

I checked your dir_info() function now. I added a printf() for the case that opendir() fails. And that's interesting. I get the following output:

cant open Windows/appcompat/appraiser
cant open Windows/appcompat/Backup
cant open Windows/appcompat/Programs
cant open Windows/CSC
cant open Windows/diagerr.xml
cant open Windows/diagwrn.xml
cant open Windows/LiveKernelReports
cant open Windows/Microsoft.NET/Framework/v3.0/Windows Communication Foundation/SMSvcHost.exe.config
cant open Windows/Microsoft.NET/Framework64/v3.0/Windows Communication Foundation/SMSvcHost.exe.config
cant open Windows/ModemLogs
cant open Windows/Prefetch
cant open Windows/Provisioning/Autopilot
cant open Windows/Resources/Themes/aero/VSCache
cant open Windows/security/audit
cant open Windows/security/cap
cant open Windows/security/database/secedit.sdb
cant open Windows/ServiceProfiles/LocalService
cant open Windows/ServiceProfiles/MsDtsServer160
cant open Windows/ServiceProfiles/MSSQL$IIP
cant open Windows/ServiceProfiles/NetworkService
cant open Windows/ServiceProfiles/SQLAgent$IIP
cant open Windows/ServiceProfiles/SQLTELEMETRY$IIP
cant open Windows/ServiceProfiles/SSISTELEMETRY160
cant open Windows/ServiceState
cant open Windows/System32/Com/dmp
cant open Windows/System32/config
cant open Windows/System32/Configuration
cant open Windows/System32/ctc.json
cant open Windows/System32/drivers/DriverData
cant open Windows/System32/DriverState
cant open Windows/System32/httpproxy.json
cant open Windows/System32/ias
[...]

Here you see a lot of ordinary directories with access problems because of missing permissions.

But what you also see is: we have here XML and JSON and config files. And these entries are definitely no directories (and also no symlinks). But when dir_info() wants to analyze them in a recursive call this means without any doubt that the previous S_ISDIR macro (at last here in the MinGW implementation) made a wrong decision - under some conditions a file seems to be classified as directory.

When I try to open such a file using cat or bat I can't. This means these files are probably exclusively open. I don't assume a missing permission here.

Interesting is that the entire list doesn't contain a single EXE or DLL file, so executable files which are open because they are currently executed are not classified as directories.

The lstat() functions does never fail.

Update: here they use S_ISREG before S_ISDIR, perhaps this helps indeed to find out what something really is.

leo-arch commented 3 months ago

Hi @muellerto, good to have you back!

The latest implementation of the dir_info() function uses two different algorithms to determine the file type:

  1. If the system populates the d_type field of the dirent struct (that is, provided _DIRENT_HAVE_D_TYPE is defined), we rely on this value, so that we don't need to call lstat(2), which is expensive.
  2. Else, we call lstat(2) and then check the S_ISDIR() macro.

here they use S_ISREG before S_ISDIR, perhaps this helps indeed to find out what something really is.

Provided these macros are, for some reason, returning wrong info (it is indeed possible, specially on some second hand implementations like Cygwin), we could give this a try.

At least on my Cygwin, 1 is the case, but I'm not sure about MSYS. So, I added the regular file check before the directory check to both algorithms. Please give it a try and let me know if it makes any difference.

muellerto commented 3 months ago

With the new implementation I get now the following results:

/c/Program Files (du crashes, find reports several "Permission denied"):

Cmd result clifm pp difference
find DIR -type d | wc -l 8343 8342 -1
find DIR -type b,c,f,p,s | wc -l 73010 73010 0
find DIR -type l | wc -l 15 15 0

I guess that the -1 in the first line is the start directory itself.

/c/Windows (du crashes, find reports several "Permission denied"):

Cmd result clifm pp difference
find DIR -type d | wc -l 51362 51352 -10
find DIR -type b,c,f,p,s | wc -l 189492 189501 +9
find DIR -type l | wc -l 1 1 0

Here we have some more differences and I guess there's still a problem with wrong classified files or directories, but the numbers are much smaller now. I'll check this. But it looks already much, much better.

muellerto commented 3 months ago

It's as follows: find DIR -type d lists indeed 9 entries (this means: directories) more than pp:

Windows/System32/Microsoft/Protect/Recovery/Recovery.dat.LOG1
Windows/System32/Microsoft/Protect/Recovery/Recovery.dat.LOG2
Windows/System32/Microsoft/Protect/Recovery/Recovery.dat{12871053-0fc3-11ec-b80a-7c50791b6f4c}.TM.blf
Windows/System32/Microsoft/Protect/Recovery/Recovery.dat{12871053-0fc3-11ec-b80a-7c50791b6f4c}.TMContainer00000000000000000001.regtrans-ms
Windows/System32/Microsoft/Protect/Recovery/Recovery.dat{12871053-0fc3-11ec-b80a-7c50791b6f4c}.TMContainer00000000000000000002.regtrans-ms
Windows/System32/restore/MachineGuid.txt
Windows/System32/SMI/Store/Machine/SCHEMA.DAT.LOG1
Windows/System32/SMI/Store/Machine/SCHEMA.DAT.LOG2
Windows/System32/SMI/Store/Machine/SCHEMA.DAT{a2332f24-cdbf-11ec-8680-002248483d79}.TM.blf

But these entries are indeed files, not directories. The error is in the find algorithm, pp is right.

I would say we can ignore the difference related to these system files. Probably all other directories will rather be right now also on Windows. Good!

leo-arch commented 3 months ago

The error is in the find algorithm, pp is right.

Awesome! I just would like to know exactly why the DT_DIR/S_ISDIR() macros are returning true in some scenarios, when they clearly shouldn't, because, in my experience, these trial-and-error solutions will soon or later fail again under different, but somehow related circumstances.

However, we clearly have a quite trustworthy implementation, and given all the differences we found in other implementations, that's a lot to say.

Thanks for your tests @muellerto!

muellerto commented 3 months ago

Awesome! I just would like to know exactly why the DT_DIR/S_ISDIR() macros are returning true in some scenarios, when they clearly shouldn't, because, in my experience, these trial-and-error solutions will soon or later fail again under different, but somehow related circumstances.

I'm also not sure. I guess that in some special cases indeed multiple bits are set in the st_mode member of the stat struct. Maybe it's just sloppy or they do indeed not get better informations about some special entries from the file system because of very hard access rights and inheritance and ownership and policies and whatever other crap. But these macros check always only one bit and that's why the order of the checks gets relevant for the result.

The Microsoft documentation which is the common base for all Windows runtime library implementations says only:

st_mode: Bit mask for file-mode information. The _S_IFDIR bit is set if path specifies a directory; the _S_IFREG bit is set if path specifies an ordinary file or a device. User read/write bits are set according to the file's permission mode; user execute bits are set according to the filename extension.

That's all. They do not mention how many bits can be set. Also many Linux man pages do not exclude multiple bits.

muellerto commented 3 months ago

What I found now is still a mismatch in the summed up file sizes and the files counted. This comes because of the two different methods with du and the recursive lstat calls.

You call du as du -s --apparent-size --block-size=1 -- DIR. This gives a result in bytes.

I did now the following:

This gives also a result, very similar to the du result - but not the same, my sum is bigger. How that?

I saw then that du returns exactly my result when it is called with the -L parameter (dereference links).

But is this right? Does dir_info() dereference symbolic links when the link points onto a directory? Probably yes because the S_ISDIR check comes before S_ISLINK. This means we have currently under some conditions more files counted than summed up file sizes.