File type statistics don't add up = literally

Lithopsian commented 10 months ago

I just noticed this file type statistics window for a library directory on my system. The bold totals don't match the suffix sub-totals. Screenshot_2024-01-11_21-47-44

I think the two most obvious mismatches are down to the patterns: "lib.a", should be Object file, but they are ending up under .a in Other; and ".so.", should be Shared object but are ending up in under Other (they have all sorts of numeric suffixes after the .so. bit).

shundhammer commented 10 months ago

That's probably limited to libraries because of the much more complex rules there. I don't think that applies to other categories as well. There might also be some overlap in the rules between different categories that may have crept in.

A check if this is any different from previous versions (stable 1.8.1, 1.8, 1.7 is probably also worthwhile. A first check here showed that it behaves no different in those older versions, but I might be wrong.

We now also have the "Find Files" function which also shows a count at the bottom of the results list (albeit it's limited to 1000 maximum), and there is the plain find command line tool.

shundhammer commented 10 months ago

AFAICS it only uses the suffix rules for that table. I don't exactly recall if that was intentional, but it might very well be; it's been a long while.

suffix-only

That whole file type statistics was something that more or less a single user wanted back then (and he kept nagging, and I finally gave in; not sure if that was a good idea). If you go back in the issues history you will see that I had always said that it's pretty pointless since on Linux the rules are much more complex than on Windows, where that whole idea originally comes from (WinDirStat), and I always said that it comes with a lot of caveats.

Some of that could be papered over with more complex regexps or checking permissions, but other aspects are and will always be a bit inconsistent. This isn't Windows where an .exe suffix clearly indicates an executable. There are tons of files with really creative suffixes, even more without any suffix whatsoever, and sometime even contradictory ones.

This is also one reason why in some areas I kept the MIME categories quite broad; just "Libraries" (in the broadest sense), not subdividing them even further like you obviously did on your system. It's just too easy to get contradicting rules if you don't pay very close attention.

This is not an exact science, it's more rules of thumb.

Lithopsian commented 10 months ago

Older versions look the same. I can go all the way back to 1.6 at the weekend if you think that would be helpful.

The current categorizer doesn't report regexp matches like "lib.a" as having a suffix even if they have a suffix. Matches like ".so.*" also even more obviously not reported as having a suffix. The dialog does derive a suffix for all its matches even when the categorizer doesn't report one, but then it starts lumping them all together and I think that's where it goes wrong. For example, every file that it finds with a suffix of .3 gets forced into the Other category even if the categorizer found an actual category for it. I think it should only lump suffixes together within each category, not into a single bucket, then everything would add up. Each category might end up with an Other grouping (or a separate grouping for no category?) either for junk suffixes or suffixes not reported by the categorizer. and the percentage reported for each suffix group within a category would match the category total.

The categorizer itself could also be more intelligent about matching complex wildcards that include a suffix. For example, "moc_*.cpp" should be matched before "*.cpp". Currently it never gets matched at all and all the qt-generated files end up in the source category instead of the generated-files category. Regexp patterns that include a suffix can be included in a multi-hash which would dramatically reduce the number of full regular expression tests that have to be done because they so much slower than the map lookups.

shundhammer commented 10 months ago

The more complex patterns get precedence (IIRC), so it's perfectly normal that anything that matches any of those is no longer put into any of the suffix categories. That is expected and intentional.

There is also the question if the "Locate Files" window could still reliably locate them all; if not, that would only make the problem move to a different place.

Also, see the extensive discussions about "cruft" somewhere in the issues discussing this file type statistics thing. IIRC there is also some debug logging in the code that is just commented or #ifdefed out that can show all the stuff that is also found, but that are not real suffixes; just weird filenames that some developer thought up, and hey, why not use a period or half a dozen in a filename if it strikes my fancy?

Let's not overdo this whole thing. Its usefulness is very limited to begin with, and it already created more problems than it's worth. The MIME categories are useful for colorizing the treemap, to get a visual impression about dominating file types; but the numeric file type statistics are useful only in certain cases, and as soon as more complex expressions are involved, it pretty much falls apart.

I also don't want to add an "other" section (that could not be used for "Locate Files") in each category. The suffixes we can see is the amount of detail that is viable and useful. Yes, there may be more stuff that is not shown. That's life.

If anybody really wants to get more detailed matches, there is now the "Find Files" function.

shundhammer commented 10 months ago

And BTW no, I don't think moving backwards in time beyond, say, V1.7 would be useful; there weren't many changes in that whole area for a long time.

Which also shows that this is very likely not used very much to begin with.

shundhammer commented 10 months ago

shundhammer / qdirstat

File type statistics don't add up = literally #241