Closed Lithopsian closed 11 months ago
First of all, thank you for taking the time to dig deep into this since it's a nontrivial area of QDirStat. The code and the algorithm go back until KDirStat 2.3.x from 2003; that's 20 years ago. While porting this from KDirStat / Qt 3 to QDirStat / Qt 5 I simply ported it from the old Qt 3 QCanvas to the new Qt 5 QGraphicsView without changing the algorithm.
It is very well possible that it wasn't an ideal approach. I vaguely recall that there were several different papers, all from the TU Eindhoven, and that I didn't follow their algorithm exactly; for what reason, I don't recall anymore, however.
When I introduced different size methods in early 2020 with QDirStat 1.7 (see issue #134), there were some compromises to make: When and how to use allocated vs. byte size (the nominal size displayed by ls -l
). Each one has its justification and its use case; but confronting users with two different sizes everywhere would quickly overwhelm the average user, and it would make the user interface really unwieldy.
Hence the hybrid (some might say "muddled up") approach: For files, in most places the byte size is used; with some exceptions like sparse files (which are rare in today's Linux) and tiny files that consume less than one cluster (typically 4k on most filesystems).
For directories and entire subtrees, the allocated sizes are used in most places. That sounds a bit inconsistent, but in real life, it feels natural and intuitive as outlined in issue #134 (I tried both approaches).
That of course means that there is sometimes a slight difference which becomes obvious in the treemap handling, as you noticed.
The treemap in QDirStat is very useful to give a rough overview. In order of importance:
To show immediately where there are big blobs in the hierarchy. That is by far the most useful aspect of the tremap in an application like QDirStat.
To roughly show file type categories with the colors of treemap tiles. Is the content dominated by any of them? Or is there mostly grey, i.e. "miscelaneous stuff" that cannot easily be categorized?
Are few larger items dominating the disk space usage? Or are they files of roughly similar size?
What it does not do (and it was never intended to) is a 100% correct representation of the file tree. It serves the above specific purposes, though.
As items and thus treemap tiles become smaller, they become much less important; they (literally!) fade into a grey mass. Hence the MinFileSize below which they are not even rendered; the background (their parent directory) "swallows" them. That is QDirStat's way of showing "miscellaneous tiny stuff, who cares". And that is the key aspect of that: Who cares. It's not important.
If you want to see more details on that particular level, use the mouse wheel to zoom in, and they become visible.
But that cut-off below the MinTileSize is a huge performance boost, because some orders of magnitude less treemap tiles need to be rendered. If it's below MinTileSize (3*3 pixels IIRC), you couldn't make out any tile borders anyway, let alone cushion rendering; maybe (but just maybe) a file type category by color; at the cost of having to determine the category even for miniscule tiles.
You can try setting MinTileSize manually to 1 in the config file. I found that it doesn't give any more insights, but it's a considerable performance penalty.
First of all, thank you for taking the time to dig deep into this since it's a nontrivial area of QDirStat. The code and the algorithm go back until KDirStat 2.3.x from 2003; that's 20 years ago. While porting this from KDirStat / Qt 3 to QDirStat / Qt 5 I simply ported it from the old Qt 3 QCanvas to the new Qt 5 QGraphicsView without changing the algorithm.
It is very well possible that it wasn't an ideal approach. I vaguely recall that there were several different papers, all from the TU Eindhoven, and that I didn't follow their algorithm exactly; for what reason, I don't recall anymore, however.
Yes, kdirstat and even windirstat are exactly the same. I was aware that a "more" squarified layout was possible and thought I'd have a look at how complicated it would be to use it in qdirstat. The eureka moment when I realised it was essentially flipping one less-than to a greater-than made me suspect it was just an accident in the original coding. Perhaps a minimal patch just so you can see the basic difference, keep your eyes peeled.
I'll experiment some with that. If it improves the treemap, I am all for it.
Here's a basic patch for the squarification (named*.txt because github won't upload .patch files). It includes some changes to scaling, but everything is still based on totalAllocatedSize(). I'll upload patches for the cushion highlighting on the other thread, but you'll get the best results by applying all the patches. TreemapTile.cpp.1.txt TreemapTile.h.txt
How about a pull request from your own fork to this original repo?
https://github.com/shundhammer/qdirstat/blob/master/doc/GitHub-Workflow.md
I just merged PR #225 into a new feature branch better-tremap for parameter tweaking before merging that branch into master.
See also the summary in #236.
The qdirstat algorithm for laying out squarified treemaps is subtly different from that described by Bruls et al (https://link.springer.com/chapter/10.1007/978-3-7091-6783-0_4). I don't know if this was intentional or a happy accident back in pre-history that produced something that looking like it was working.
The current algorithm lays out tiles in rows along the longer side of the parent rectangle. This produces a series of ever-narrower stripes along the rectangle, each with a series of hopefully near-equal near-squares. It is prone to non-square tiles at the extremes of the long rows, and the last row in particular produces stretched tiles. See the attached screenshot for an example.
The current algorithm is prone to leaving empty space at the end of rectangles. There are a number of reasons for this, and it isn't obvious when a minTileSize is set. However, even with minTileSize=0, there can be empty space. This is partly due to some rounding errors (much worse for the non-squarified code) and partly because the parent totalAllocatedSize doesn't match the sum of the children totalAllocatedSizes. In this particular case (theme icons) there are a lot of symlinks which have zero allocated space but do take up space. (totalSize is non-zero).
The exact algorithm described in the paper lays out each row along the shorter side of the parent rectangle. This subtle difference means that the rows are shorter, the remaining space stays relatively square, and rows are laid out in roughly-alternating horizontal and vertical directions as the remaining space shrinks. See an attached example, showing the same files as before (note that the shading has also been fixed, see my other issue).