shundhammer / qdirstat

QDirStat - Qt-based directory statistics (KDirStat without any KDE - from the original KDirStat author)
GNU General Public License v2.0
1.71k stars 123 forks source link

New Feature: Showing Dominating Tree Items in Bold Font #210

Closed shundhammer closed 1 year ago

shundhammer commented 1 year ago

New Feature: Showing Dominating Tree Items in Bold Font

Executive Summary

QDirStat now shows items in the directory tree in bold font if they are clearly dominating that directory level.

If you don't like that, you can disable it in the "General" page of the config dialog.

QDirStat-main-startup

Notice the items in bold font: They are consuming considerably more disk space than the others.

Details

The 80:20 Rule

When you need to clean up a directory to make space, the 80:20 rule applies like in so many aspects of computing: 80% of disk space is consumed by just 20% of all directories or files.

How to Find Them

Individual Large Files

Use QDirStat and look at the treemap (the graphics at the bottom) to find individual large files. Are there any large blobs? Click on each one to find out what it is; QDirStat will locate it in the tree view at the upper part of its main window, and it will show you many details about it in the panel on the right.

QDirStat-big-blob

Gotcha: We have a big blob here. But that's not something for deleting, it's the Git data for QDirStat.

This way, you can even find large files hidden deep in a directory hierarchy; that ISO that you downloaded some time ago, the virtual disks of the virtual machines that you installed, the video download that was aborted and never cleaned up. If you decide to get rid of them, that's only two or three mouse clicks away in QDirStat.

Subtrees

For directory trees that consume a lot of disk space scattered over lots of smaller files, that's not quite so easy: You have to use the directory tree in the upper part of the QDirStat main window and drill down. Look at the absolute sizes in the "Size" column, at the percent column, or at the percent bar.

QDirStat-main-startup-non-bold

Keep opening branches until you found what you are looking for.

What is Relevant?

QDirStat is a tool; it can show you what you have on your disk, but it cannot decide for you what is or is not important. You need to make that decision.

But it can support you by showing you what is more dominating a directory tree, what is worthwhile to have a look at first. When you look at the previous screenshot, you can see that it already sorts by subtree size, so the largest directories are listed first.

The percentage bars are an additional visual cue: For most users, that is easier to process mentally than pure numbers (no matter if it's absolute sizes or percent). Longer bars mean more disk usage.

And now it also shows dominant items in bold font, drawing more attention than the others: That's the ones that you might want to look at first.

QDirStat-main-startup

The .git, src and screenshots subdirectory consume almost all of the disk space here, dwarfing everything else.

When you open any of those directories, the process continues on the next deeper directory levels:

QDirStat-main-src-files

_The qdirstat binary is the largest item here by far, followed by the generated qrcicons.cpp file built from a lot of .png files; and suprisingly the Makefile (generated by qmake from the .pro file) is also quite large.

Notice that this does not at all mean that any of those are candidates to delete; it only means that you should have a close look at them to make a decision, even if that decision is "yes, I need them".

Inconclusive Results

In many scenarios, there is a small number of items that are so large that trying to save disk space with the other ones is pointless; it's the large ones that dominate that directory level.

And then there are other scenarios where the file or directory sizes are a lot more evenly distributed; some may be a bit larger than others, but not by that much. That happens, and it's normal. In that case, QDirStat will not display any of them in bold font. Like in this example:

QDirStat-main-src-obj

Lots of similar sized files. None of them is really dominant in this directory.

Sort Order Matters

QDirStat will only show dominating items in bold in the normal sort order: By percent or size descending, i.e. the largest items first.

If you click on any other column header to sort by that column, or if you invert the sort order to ascending by percent or size, the bold font will go away. That is intentional; both by technical reasons and for usability.

You might be interested in the latest modification time (finding out what is newest or oldest), or switch to a different column layout with the L2 / L3 buttons and sort by the number of files or subdirectories etc.; in that case, your focus is somewhere else, not on file size. In that case, the file size should not get in the way by still displaying the dominant files in bold.

Is it Perfect and 100% Reliable?

No, of course not. Nothing ever is. It works reasonably well, though.

Even when you try to decide as a human which items are dominant in any given directory and which ones are not, there are always fringe cases. Should those next two directories with 2.5% of the overall size each also be added to the dominant files? Or should that 7% directory not be there since it is already dwarfed by the 75% first one? It's not a clear-cut thing.

The Algorithm

At the time of this writing, what QDirStat does is to look at the largest 30 items. If there are more, they are simply ignored. It picks the median percent value of those items, and everything 5 times as large as that median is considered dominant; with a minimum of 3%, and a maximum of 70%. I.e. anything below 3% is never dominant, and everything from 70% and up always is.

Those values may be a bit tweaked in the future, and when they stabilize, they might even become configurable (manually in the config file, not in a GUI config dialog). But not right now because it's really hard to override config files in a later version when the defaults turned out to be not very good.

Getting Rid of It

If you don't find this feature helpful, it's easy to disable it:

In the "File" menu, select "Configure QDirStat", then select the "General" page. Uncheck "Use bold font for dominant tree items" and restart the program.

QDirStat-config-general

shundhammer commented 1 year ago

Development History: First Attempts

Using Average and Standard Deviation

The first attempt was to calculate the average size and the standard deviation and then use a factor how many standard deviations items need to be outside that range to be considered dominant.

That didn't work too well: Trying to fine-tune that factor manually (between 1.5 and 3) showed that this would be needed for each directory individually; the same factor was way too small for some directories and way too large for others.

Multiple Iterations

But a real hacker doesn't give up that easily. So there was the next refinement: Do that a number of times with different factors; including a new multiplication factor that would change between iterations, so for the first iteration items would need to be 1.5 or 2 standard deviations away from the average.

Then the dominant items that were found were added to a list of dominant items and taken out of consideration for the next iteration, new average and standard deviation calculated, then checked what items would fall beyond that new threshold; and repeating that for a third time.

That also failed with the same problems: For some directories it was too generous to consider items as dominant, for others it was too strict.

Using the Median, Quartiles and the Interquartile Distance

Sounds scary, eh? ;-) If it does, you might want to read the QDirStat documentation on those things; it's a lot simpler than you might think.

https://github.com/shundhammer/qdirstat/blob/master/doc/stats/Median-Percentiles.md

Anyway, the general approach was similar, just with the interquartile distance measured from the median and multiplied with a factor: (Q3 - Q1 + median) * 1.5 (or larger factors).

The result was a bit better than with the average and the standard deviation, but not by much: There was too much variation between directories. Even with a progressively larger factor and multiple iterations.

The Current Algorithm

What really helped was to throw all the little stuff out of the equation and to approach it in a much more pragmatic way: Highlighting dozens of items in bold is useless. If it's any more than 3 or 5 (or 7 as a maximum), it doesn't help anymore: It's the "focus on everything" approach (Dilbert's boss, anyone?).

Thus the decision to limit the number. And since for all the calculations a sorted list is needed anyway (it uses the same one that is already used for sorting the items by percent or size), it's easy: Just regard a reasonable number of those items, starting with the largest ones.

In this case, 30 was the initial pick, and it proved to be useful: All the little stuff beyond 30 items doesn't add to the information. If a decision cannot be made within 30 items, there is no decision; it's an inconclusive case, there are no dominant items. And that's okay, too; that happens in real life. It's also a result: There is nothing that warrants special attention, they are all roughly the same size.

Within those (no more than) 30 largest items, pick the median: It's the item in the middle of the list. Since we are doing integer calculations here, it may be one position off; who cares. That doesn't change the result in any significant way.

Is the interquartile distance relevant? No, not really; we already picked the high end of the list. We are simply using the median; which for large directories may be near the 3rd quartile (Q3) of the full (much longer) list, or even near the 90th percentile. Who cares; we want to know if the percentage bars make a sharp turn near the start of the list (the largest items). So the threshold is that median * 5. We want at least 3% of the total disk usage of that directory (it's what you can roughly make out visually in the window), and anything 70% and up is always dominant.

It's rough and dirty, but it works, and it's even efficient in terms of performance: No additional trees need to be chopped off in Finland to generate more energy for this. It happens in roughly constant time, regardless of directory size.

shundhammer commented 1 year ago

Discussion

This issue is intended for documenting the feature, and for possible changes in the future. For any discussion or questions (which are welcome!), please use GitHub issue #211. For related bugs, please open a new issue.