shundhammer / qdirstat

QDirStat - Qt-based directory statistics (KDirStat without any KDE - from the original KDirStat author)
GNU General Public License v2.0
1.73k stars 124 forks source link

[Suggestion] Add an extension table with stats #45

Closed C0rn3j closed 7 years ago

C0rn3j commented 7 years ago

WinDirStat has this extension table that lets you see files by their extension.

I'd love to see this feature in QDirStat, as I think it'd be helpful to many people. Personally right now I'd like to see how much data I have in image files, but I cannot use QDirStat to find out.

shundhammer commented 7 years ago

Did you see the treemap color configuration? This is the graphical equivalent of what you suggest.

I know that WinDirStat has that sidebar that lists the most used file extensions. The problem with that on Linux (and other Unix-like systems) is that there is no real equivalent: You cannot simply tell a file's type by its filename extension in most cases. That works only for a very limited variety of files, such as images (.jpg, .png, ...), videos (.mp4, .avi, .mkv, ...) and to some extent text or office files. That's basically the ones the QDirStat treemap highlights: Everything in color belongs to any of those categories.

And now look how much grey tiles you see in the treemap. That's "the rest", "miscellaneous", a.k.a. "I have no clue what that stuff might be". In particular, executables don't sport anything like "*.exe" as on Windows, and there are several dozens if not hundreds of files that belong to some software package or the other, but that cannot be identified as such without reading at least part of them: That's what the file command does. While this is realistic to do with one directory at a time (some few screenfulls of ls output), it cannot be done with some 200,000 files that typically comprise a Linux root filesystem.

shundhammer commented 7 years ago

qdirstat-kubuntu-root

This is a screenshot of QDirStat displaying the root filesystem on my Kubuntu machine. Notice how much grey stuff you see. The orange tiles are libraries of some kind, but that's already kind of cheating since it assumes all "lib" and all ".so.*" files are libraries (it might also be something completely different).

shundhammer commented 7 years ago

qdirstat-work

This screenshot shows my /work partition where I keep my photos, videos and music, but also some VmWare virtual machine images. They consume most space on that partition, yet they cannot be identified (ok, I could edit the MIME type rules for them to give them a color).

shundhammer commented 7 years ago

Verdict: Things just don't work that way on Linux. While some of the file types can be identified, all in all they represent just a small part of the overall picture (literally even). While that file type side box might be of limited interest on Windows (even there I have doubts), it loses all usefulness on a Linux-like system.

When I started the QDirStat rewrite from the KDE 3 KDirStat code, I carefully considered the pros and cons of that WinDirStat feature and decided against it.

C0rn3j commented 7 years ago

Thanks a lot for explanation and the write up, I really appreciate it.

I understand if you think that this feature wouldn't have a huge user base, I don't know how other people usually deal with their files.

I get that Linux works mostly without any extensions, so it would be more or less useless for system files, but for user files it could be very helpful.

For example - my use case at the time - enumerating all JPG/PNG files and looking at the file size.

Personally I wouldn't mind only being able to see user files, as managing OS files is a thing for a package management system.

C0rn3j commented 7 years ago

Just now I had to deal with a samba share consisting of 31k files - it was very useful to see where what files with what extensions were(as I was converting lots of images/videos/audio).

QDirStat was helpful but for overview by extensions and highlighting all files with that certain extensions I had to resort to WinDirStat.

Just wanted to say that the use case is there, even if not as much as on Windows ^^

shundhammer commented 7 years ago

I am curious: What's the information good for that you have 12 GB worth of JPG files vs. 7 GB worth of PNG files in a directory tree? Because that's about all that this view would give you.

In the current QDirStat, by default those very similar file types are grouped together in the same MIME categories (Settings -> Configure -> MIME Categories) so they show up in the same color in the treemap which gives you a much better impression of their relative proportion in the tree, if they are grouped together or scattered around, and if it's a lot of little things or a few big blobs.

And grouping them into categories rather than have another color for each individual filename extension is the only thing that makes sense - there are just not enough colors anybody can reasonably tell apart from each other. Just look at the number of filename extensions for common file types like images, videos, all kinds of office documents.

But if you really need it to be more specific, you are free to configure it to your liking; just add more MIME categories, move the filename extensions around as you like and assign a color.

And unlike in WinDirStat's file type list, the colors are stable (and configurable per MIME category); WinDirStat has a sequence of colors that always assigns the type with most disk usage (whichever that may be) to color #1, the next most to color #2 etc., so you'll never know by looking at the treemap what each of those colors means; so in WinDirStat, the primary use of that extra panel is to be a legend for the treemap. And it's necessary there because the colors mean something different each time.

So, the funny thing is that while searching the web for example screenshots, I did not find a single WinDirStat screenshot that did not have the size column of that file type panel cut off - most often it's cut off completely, in a few cases it's only cut off partly. That's how important that information is to people.

So, seriously, what's the use case for that information? I am not going to clutter the display of something as complex as QDirStat with even more stuff that does not add a value to most users.

Also, be aware that QDirStat is not meant as a substitute for the find command. If you need to convert a lot of files that are scattered in a directory tree, find and mogrify (or convert) are your friends.

C0rn3j commented 7 years ago

Also, be aware that QDirStat is not meant as a substitute for the find command. If you need to convert a lot of files that are scattered in a directory tree, find and mogrify (or convert) are your friends.

I am aware, I was using find/convert/ffmpeg to do the job and QDirStat to locate folders with the files (Since I had to do this guided - as I can only convert some files and not others because of reasons).

Thanks for the tips though. Never heard about mogrify and a while back it could save me a lot of time >.<

I am curious: What's the information good for that you have 12 GB worth of JPG files vs. 7 GB worth of PNG files in a directory tree? Because that's about all that this view would give you.

Example - Internet connection here is not the greatest, so uploading/downloading 25GB worth of JPG files when syncing to a remote backup server takes a while. After converting about some 5GB of JPG files to WebP, I could easily see that I just saved about 2GB, now that I have 20GB worth of JPG files and 3GB of WebP files.

I could see that I get some near 50% compression rate and weight the benefits of converting everything to WebP vs keeping something/most files in JPG for compatibility with a site that does not support WebP to avoid the hassle of reconverting back to JPG when those files need to be used again.

Really helps with arranging files when you can instantly see "Hey, 20GB of your files are JPG, 2GB are BMP files which could easily be 10MB total after converting and you have 2GB of PDF files, you monster".

EDIT: I wouldn't really need a complete clone of how WinDirStat does it(with the selecting magic and such)

shundhammer commented 7 years ago

I did some experimenting. It's not very pretty yet, but it shows the principle:

qdirstat-file-type-stats

Notice that right now it's just text mode because I didn't have the slightest clue if it would warrant the additional work to make it pretty with a table and everything.

So far, this code lives in a Git branch. If you know how to do that, you can check it out, build it and see it live on your machine:

https://github.com/shundhammer/qdirstat/tree/huha-extension-stats

After the usual git clone, change to that branch:

git checkout huha-extension-stats

then build QDirStat as always:

qmake
make
sudo make install

You get to that new window with menu View -> File Type Statistics....

shundhammer commented 7 years ago

Complete output from my /work partition:

Videos:                        108     10.4 GB
Music:                        2879     14.5 GB
Compressed Archives:           104      3.6 GB
Documents:                    5889    471.3 MB
Uncompressed Archives:           4     44.3 MB
Junk:                            6     31.1 kB
Source Files:                23771    239.6 MB
Object or Generated Files:     478     64.0 MB
Compressed Files:              784      1.2 GB
Images:                      45572     37.6 GB
Uncompressed Images:            68      5.5 MB
Libraries:                      69    404.5 MB

<No extension> :    4275    515.0 MB
*.0-b1 :              6     75.0 kB
*.0-b2 :              4     20.3 kB
*.0-rc1 :             2     32.7 kB
*.0-tp1 :             2      1.7 kB
*.ac :                3     48.5 kB
*.aidl :             10     14.5 kB
*.am :               38     69.3 kB
*.app :              15     10.0 kB
*.asm :               4     52.1 kB
*.atn :               3    144.1 kB
*.awk :               4      8.0 kB
*.bak :               6     31.1 kB
*.bar :               4     0 Bytes
*.bat :              17     22.0 kB
*.bc :                3    705.4 kB
*.bcc :               2      2.8 kB
*.bdf :              28      6.7 MB
*.before :           15     13.8 kB
*.bin :             176     11.5 MB
*.bmp :              36      1.4 MB
*.bod :               2      4.1 kB
*.bz2 :              28    214.9 MB
*.c :               820     25.2 MB
*.cache :            16    393.3 kB
*.cbproj :            3     27.4 kB
*.cc :              196      2.1 MB
*.cert :              4     16.6 kB
*.cfg :             277    534.8 kB
*.cgi :              23      4.0 kB
*.cmake :           118    146.6 kB
*.cmd :               2      6.6 kB
*.common :            2     78.6 kB
*.conf :            308    474.1 kB
*.cpp :            9715    150.5 MB
*.craft :            32      2.1 MB
*.crt :              11     26.5 kB
*.css :            2569      5.7 MB
*.csv :               4      1.2 kB
*.cxx :               2      8.7 kB
*.dat :              20     47.6 kB
*.data :           3019    251.7 kB
*.db :               16     13.9 MB
*.deb :               2     39.0 MB
*.debug :             2      1.6 MB
*.def :              39      4.6 MB
*.dejavu :            2      7.0 kB
*.der :              48     16.4 kB
*.desktop :         257    100.6 kB
*.diff :             54    287.9 kB
*.directory :        39      4.1 kB
*.dj :                3      8.1 kB
*.dll :              31     17.4 MB
*.doc :               2    102.4 kB
*.docx :              2     59.0 kB
*.dot :              58     30.6 kB
*.dox :              15     29.1 kB
*.dsp :               3     34.0 kB
*.dsw :               3      1.5 kB
*.el :                5    145.2 kB
*.elc :               4     68.1 kB
*.ent :             105      7.0 kB
*.eps :               3    551.3 kB
*.err :             129     23.7 kB
*.exe :               8     40.8 MB
*.expect :           11      1.9 kB
*.ext4 :              4      7.5 MB
*.flm :               5      5.6 kB
*.footer :            7    105.6 kB
*.frm :              82    766.4 kB
*.ftl :             104    159.5 kB
*.g :                10    535.6 kB
*.gccxml :            2      5.3 kB
*.gif :             746      9.3 MB
*.gitignore :       432      8.4 kB
*.glsl :              4      1.5 kB
*.golden :            2      5.7 kB
*.guess :             6    263.6 kB
*.gyp :              74     71.3 kB
*.gz :              756    988.6 MB
*.h :              9362     54.3 MB
*.header :           10     31.0 kB
*.heu :               4   603 Bytes
*.hh :               50    602.1 kB
*.hlsl :              2      2.0 kB
*.hpp :              91    652.3 kB
*.htm :               2     37.0 kB
*.html :            277      3.0 MB
*.ibd :              16     32.0 MB
*.ico :              39      1.2 MB
*.idl :             187    571.7 kB
*.idx :              10    731.5 kB
*.img :               3     41.2 MB
*.in :              116    664.6 kB
*.includecache :      29    251.0 kB
*.inf :              37     11.0 kB
*.info :              4      1.5 kB
*.ini :              48     13.9 MB
*.internal :         29     76.4 kB
*.jar :              33     36.1 MB
*.java :            520      3.7 MB
*.jpeg :              8      2.2 MB
*.jpg :           28311     37.5 GB
*.js :             1543     11.0 MB
*.json :            202      4.4 MB
*.jsonlz4 :           9     35.0 kB
*.key :               9      8.9 kB
*.l :                 3     16.5 kB
*.ldb :              24     11.2 MB
*.lib :              16     72.2 kB
*.little :            2      2.3 MB
*.localstorage :     431     45.7 MB
*.log :             279     12.0 MB
*.lst :               4      1.4 kB
*.m :                 2      6.4 kB
*.m4 :               75      1.4 MB
*.mac :               2      2.3 kB
*.macros :            4     39.4 kB
*.make :            131    471.3 kB
*.manx :              2      2.3 kB
*.mbm :             444    788.1 MB
*.mc6 :               2      3.2 kB
*.md :                4     15.6 kB
*.md5 :               3      8.2 kB
*.mdb :               3    327.3 kB
*.mf :               38    109.0 kB
*.mingw :             2      9.2 kB
*.mk :              144    203.0 kB
*.mm :              283      3.2 MB
*.mmp :               3      7.7 kB
*.mng :               5    100.6 kB
*.mp3 :            2587     14.4 GB
*.mp4 :              95      6.9 GB
*.msf :               5     49.2 kB
*.mu :              271     18.0 MB
*.myd :              47      1.3 MB
*.myi :              47      1.6 MB
*.nib :               6     14.6 kB
*.nop :               5    153.7 kB
*.ntc :               2      1.2 kB
*.o :               201     52.0 MB
*.odp :               1      1.5 MB
*.ods :              17    399.5 kB
*.odt :               5    211.7 kB
*.ogg :               6      1.9 MB
*.old :              25     15.1 kB
*.otf :               2     82.1 kB
*.ott :               4    130.2 kB
*.out :               5     17.9 kB
*.pack :             10    109.3 MB
*.patch :            60   1015.1 kB
*.pbm :               7      1.2 kB
*.pbxproj :          11    671.9 kB
*.pdf :             123    362.5 MB
*.pem :             111    111.4 kB
*.pfa :              26      2.3 MB
*.pfb :              16    584.3 kB
*.pgm :               5      2.5 kB
*.php :               2      3.0 kB
*.pl :               72    247.3 kB
*.plist :             4      2.7 kB
*.pm :               11     96.6 kB
*.png :           16403    111.8 MB
*.po :               43     10.2 MB
*.pot :               2    550.0 kB
*.pov :               5      4.0 kB
*.ppm :              16      3.7 MB
*.prefs :            10     59.6 kB
*.prf :             138    152.6 kB
*.pri :             302    349.6 kB
*.prj :               2      1.1 kB
*.pro :            2788    885.1 kB
*.properties :      359      4.0 MB
*.ps :                2     11.2 kB
*.pset :             14      3.8 MB
*.pubkey :            8      1.7 kB
*.pump :              3     27.0 kB
*.py :              106    848.7 kB
*.q42 :               2      1.3 kB
*.qch :              12    800.0 kB
*.qdoc :            647      6.9 MB
*.qdocinc :          22     97.5 kB
*.qhc :               7     68.0 kB
*.qhp :               2      3.4 kB
*.qm :               42    582.4 kB
*.qml :             969      4.3 MB
*.qpf :              58      3.0 MB
*.qpf2 :              2    792.9 kB
*.qph :              15    292.9 kB
*.qps :             189    411.2 kB
*.qrc :             152     19.6 kB
*.qs :               25     29.2 kB
*.qsnap :           166    136.2 kB
*.qss :              17     32.3 kB
*.qtt :              22     71.2 kB
*.raw :               2     34.4 kB
*.rb :                3     18.9 kB
*.rc :               19     74.7 kB
*.readme :           11      1.1 kB
*.ref :             868    574.0 kB
*.result :           46     74.4 kB
*.rpm :              18      1.1 GB
*.rsa :              28    213.7 kB
*.rsh :              16    406.9 kB
*.rss :               3      9.9 kB
*.run :               5    147.0 MB
*.s :                23    277.5 kB
*.sample :          116    171.1 kB
*.san :               4   432 Bytes
*.sas :               2      2.3 kB
*.sbstore :          14      5.4 MB
*.sci :              18      1.8 kB
*.sdp :               1   694 Bytes
*.sed :               3      1.1 kB
*.sf :               28    104.4 kB
*.sh :               50      1.3 MB
*.sic :               2      1.4 kB
*.sin :               2      1.1 kB
*.sk :               22     66.1 kB
*.skin :              8      7.5 kB
*.sln :               8     94.1 kB
*.so :               24      2.5 MB
*.sol :             150     15.7 kB
*.spec :              3     17.7 kB
*.st :                2      2.6 kB
*.stdout :            3     54.0 kB
*.sub :               6    201.5 kB
*.svg :             541     11.1 MB
*.svgz :              5     13.1 kB
*.svn-base :       1750      4.6 MB
*.swz :               4      1.1 MB
*.sxi :               5      8.3 MB
*.table :             3      5.8 kB
*.tar :               4     44.3 MB
*.tdb :               4     68.0 kB
*.tex :              26     73.8 kB
*.tga :              14     22.6 MB
*.tif :               9      4.7 MB
*.tiff :             34    707.6 kB
*.tlb :               2      2.4 kB
*.toc :              34    273.9 kB
*.trace :            12      3.8 MB
*.truecolor :        27    318.3 kB
*.ts :              220     13.2 MB
*.tst :               8     0 Bytes
*.ttf :              63      8.8 MB
*.txt :             792     81.2 MB
*.ui :              503      4.7 MB
*.vbr :              11   848 Bytes
*.vc :                2      2.6 kB
*.vcproj :           96      2.5 MB
*.vmdk :             44     25.0 GB
*.vms :               2      1.9 kB
*.vmx :               2      4.9 kB
*.vmxf :              2      3.2 kB
*.vsprops :          11     23.7 kB
*.wat :               2      2.2 kB
*.wav :             286     15.3 MB
*.webm :              8      3.5 GB
*.wmf :               2      7.4 kB
*.wml :               2      2.7 kB
*.woff2 :            21    128.0 kB
*.xbel :             11     72.9 kB
*.xbm :              83    252.8 kB
*.xcconfig :         21     39.9 kB
*.xcf :               4    332.9 kB
*.xls :               6    182.5 kB
*.xml :            2051      8.2 MB
*.xpi :               4      2.4 MB
*.xpm :              61      1.0 MB
*.xq :              137     22.3 kB
*.xsd :              23     58.5 kB
*.xsl :              21    137.0 kB
*.y :                 4    178.3 kB
*.yy :                2     13.0 kB
*.zip :              51      2.5 GB
shundhammer commented 7 years ago

The list is still sorted by filename extensions. Once there is a proper list widget, of course the user will be able to switch the sort order between extensions, number of items, and total size. There will also be a percent column.

I will probably add tabs to switch between the categories (seen here on top) and the extensions -- just to avoid confusion because otherwise items will be counted twice, and the sum of all percentages will be way above 100%.

shundhammer commented 7 years ago

I still don't know how useful this really is, but usefulness is decided by the users. Maybe people will really get creative what to do with this.

One problem was that on a Linux filesystem there is a lot of "cruft" that accumulates in statistics of this kind: Unlike on Windows, a dot does not only serve to separate a file's base name from its extension; it is also used often enough as a general purpose character in filenames. It's really hard to tell automatically what is a real filename extension and what comes from the crazed imagination of some Linux developer who thinks dots in filenames are just great. For example, look at your /var/cache/apt/archives/ directory. There, you will find such gems as

linux-image-3.13.0-107-generic_3.13.0-107.154_amd64.deb
nvidia-340_340.101-0ubuntu0.14.04.1_amd64.deb

There are dots all over the place. Of course, a human can easily tell that in this case, .deb is the relevant extension. In other case, it's not so simple; for a home-backup.tar.bz2 I really want to have the .tar.bz2, not just the .bz2.

In my new code, I used a lot of heuristics, and they might result in false positives or negatives.

For example, QDirStat already has a class called MimeCategorizer to figure out the treemap colors. It already knows a lot of filename extensions (and regexp rules). So I found another use for that class here when trying to figure out bona fide extensions (as opposed to random crap where people misused the concept of filename extensions to do any random stuff).

If the MimeCategorizer knows a suffix (a filename extension), I believe it. Not (only) because I wrote it, but because those lists are carefully hand-crafted.

And then there are the real heuristics if the MimeCategorizer doesn't know a suffix; beware, there be dragons and all. ;-)

If a suffix consists solely of numbers, it's very likely that it's not anything useful, so those files are discarded (disregarded in the statistics).

If a suffix has 3 letters, it's very likely a valid filename extension. Those are kept.

If a suffix is very long, and there are very few files of that type, it's probably also cruft, and those files are disregarded.

Etc. etc. etc.; see FileTypeStats::isCruft() at

https://github.com/shundhammer/qdirstat/blob/huha-extension-stats/src/FileTypeStats.cpp#L195

To give an impression just how much such cruft there is, here is some log output while removing it:

2017-02-10 19:08:33.065 [17136] <Debug>   FileTypeStatsWindow.cpp:151 removeCruft():  
Removing cruft *. bar
Removing cruft *.0
Removing cruft *.0-beta1
Removing cruft *.0-beta2
Removing cruft *.0-beta3
Removing cruft *.0-beta4
Removing cruft *.0-beta5
Removing cruft *.0-beta6
Removing cruft *.0-garden
Removing cruft *.00
Removing cruft *.00beta1
Removing cruft *.00beta2
Removing cruft *.00beta3
Removing cruft *.1
Removing cruft *.11
Removing cruft *.1c
Removing cruft *.2
Removing cruft *.2-tower
Removing cruft *.200610
Removing cruft *.2337206
Removing cruft *.2ceping
Removing cruft *.3
Removing cruft *.30
Removing cruft *.31
Removing cruft *.39-19980327
Removing cruft *.39-19980406
Removing cruft *.39-19980414
Removing cruft *.39-19980506
Removing cruft *.39-19980529
Removing cruft *.39-19980611
Removing cruft *.39-19980616
Removing cruft *.39-19980623
Removing cruft *.39-19980625
Removing cruft *.39-19980706
Removing cruft *.3ce-tp1
Removing cruft *.3ceconan
Removing cruft *.3cekicker
Removing cruft *.3cesweetandsour
Removing cruft *.3rc
Removing cruft *.4
Removing cruft *.4-temple
Removing cruft *.40
Removing cruft *.41
Removing cruft *.42
Removing cruft *.5
Removing cruft *.6
Removing cruft *.7
Removing cruft *.8
Removing cruft *.92
Removing cruft *.93
Removing cruft *.94
Removing cruft *.95
Removing cruft *.96
Removing cruft *.98
Removing cruft *.99
Removing cruft *.agl
Removing cruft *.angle
Removing cruft *.apk
Removing cruft *.aspx
Removing cruft *.balrog
Removing cruft *.basea
Removing cruft *.baseb
Removing cruft *.bau
Removing cruft *.bcb3
Removing cruft *.bh
Removing cruft *.bjson
Removing cruft *.bk
Removing cruft *.blob
Removing cruft *.boskow
Removing cruft *.browser
Removing cruft *.c++
Removing cruft *.cache-4
Removing cruft *.cbddmfm8acchbddu8q-
Removing cruft *.cer
Removing cruft *.cfs
Removing cruft *.charter
Removing cruft *.cht
Removing cruft *.cnf
Removing cruft *.colors
Removing cruft *.config
Removing cruft *.confml
Removing cruft *.courier
Removing cruft *.cs
Removing cruft *.csm
Removing cruft *.csproj
Removing cruft *.cursor
Removing cruft *.cvsignore
Removing cruft *.db-journal
Removing cruft *.dbf
Removing cruft *.dbt
Removing cruft *.dic
Removing cruft *.dif
Removing cruft *.digest-md5
Removing cruft *.digest-sha1
Removing cruft *.disabled
Removing cruft *.docbook
Removing cruft *.doxyfile
Removing cruft *.dsc
Removing cruft *.dtd
Removing cruft *.end
Removing cruft *.eot
Removing cruft *.error
Removing cruft *.errors
Removing cruft *.ext1
Removing cruft *.ext2
Removing cruft *.filter
Removing cruft *.flex
Removing cruft *.gitattributes
Removing cruft *.global
Removing cruft *.glslf
Removing cruft *.glslv
Removing cruft *.gperf
Removing cruft *.graphml
Removing cruft *.groovy
Removing cruft *.groupproj
Removing cruft *.gypi
Removing cruft *.h-vms
Removing cruft *.helvetica
Removing cruft *.hrh
Removing cruft *.hxx
Removing cruft *.ibm
Removing cruft *.icc
Removing cruft *.icns
Removing cruft *.icon
Removing cruft *.ics
Removing cruft *.ics~
Removing cruft *.implml
Removing cruft *.in0
Removing cruft *.in1
Removing cruft *.incl_cpp
Removing cruft *.init
Removing cruft *.install
Removing cruft *.iso
Removing cruft *.kcfg
Removing cruft *.kcfgc
Removing cruft *.keyring
Removing cruft *.keystore
Removing cruft *.knsregistry
Removing cruft *.krazy
Removing cruft *.kwl
Removing cruft *.lck
Removing cruft *.lexgen
Removing cruft *.linux
Removing cruft *.list
Removing cruft *.localstorage-journal
Removing cruft *.lock
Removing cruft *.manifest
Removing cruft *.md5sum
Removing cruft *.meta
Removing cruft *.min
Removing cruft *.mingwdll
Removing cruft *.mk4
Removing cruft *.modulemap
Removing cruft *.ncc
Removing cruft *.netrwhist
Removing cruft *.nothing
Removing cruft *.nvram
Removing cruft *.opml
Removing cruft *.opml~
Removing cruft *.opt
Removing cruft *.p12
Removing cruft *.pc
Removing cruft *.pkg
Removing cruft *.ppd
Removing cruft *.pub
Removing cruft *.pxa
Removing cruft *.qdocconf
Removing cruft *.qhcp
Removing cruft *.qmlproject
Removing cruft *.qnx
Removing cruft *.reply
Removing cruft *.resource
Removing cruft *.resx
Removing cruft *.rom
Removing cruft *.rpath
Removing cruft *.rules
Removing cruft *.salt
Removing cruft *.sauron
Removing cruft *.scratchbox
Removing cruft *.sdv
Removing cruft *.settings
Removing cruft *.solaris
Removing cruft *.strings
Removing cruft *.suse
Removing cruft *.swf
Removing cruft *.sxw
Removing cruft *.tag
Removing cruft *.tarlist
Removing cruft *.tbcache
Removing cruft *.theme
Removing cruft *.thm
Removing cruft *.tox
Removing cruft *.trx
Removing cruft *.unifont
Removing cruft *.unix
Removing cruft *.utopia
Removing cruft *.uu
Removing cruft *.v20130118-173121-9mf7ghydg0b5kx4e_skfzv-1mnjvatf67zab7
Removing cruft *.v20130129-152330-7iaraabrmqkgsvmgqnulz-dqz00h
Removing cruft *.vb
Removing cruft *.vbproj
Removing cruft *.vbs
Removing cruft *.vcf
Removing cruft *.vcwin32
Removing cruft *.vera
Removing cruft *.vmls
Removing cruft *.vmsd
Removing cruft *.vmsn
Removing cruft *.vs
Removing cruft *.wait
Removing cruft *.win32
Removing cruft *.woff
Removing cruft *.xba
Removing cruft *.xlb
Removing cruft *.xlsm
Removing cruft *.xsd-license
Removing cruft *.xspf
Removing cruft *.ypp
shundhammer commented 7 years ago

Please notice that so far this is only experimental; that's why the code lives in that Git branch. There are no promises yet that this will make it into Git master (the code main line) anytime soon.

Apart from the missing pretty user interface (and there will be no permanent text mode stuff like this experimental code in QDirStat), there are still issues like how to handle tree refresh and updating this information. So far, this is only very static and thus only a snapshot of the current situation; it gets outdated whenever the user starts a cleanup action, refreshes the tree or a branch of it.

shundhammer commented 7 years ago

In the final version, the list will probably be restricted to show only the top 50 or so; the 50 filename extensions with most size/percentage and the 50 with most files (so it will be more than 50, 100 in the worst case). Or only the top 20. I don't know yet. Showing everything where most filename extensions have only a very small number of files or a very small size percentage doesn't seem to make that much sense.

The number will be configurable, of course.

C0rn3j commented 7 years ago

Thank you, this is awesome! I'll try to check the branch out soon (on vacation atm), but it looks like the exact thing I was missing!

Seriously, thank you for even willing to add this feature in a separate branch!

shundhammer commented 7 years ago

Not so fast, my friend, we are not done yet... :smile:

I turned this very crude text display into something a lot more Qt-ish: A multi-column tree with the MIME categories and corresponding suffixes (the filename extensions) that belong to that category below it. And since there are a lot of suffixes that don't belong to any of the predefined categories, there is now a category "Other" for them. But I restricted that to the "Top 20" (i.e. the 20 suffixes with the most total sizes). That number will be (but is not yet) configurable.

Screenshots:

My /work partition:

qdirstat-file-type-stats-work

My root partition:

qdirstat-file-type-stats-root

My Windows C: drive:

qdirstat-file-type-stats-win-sys

My Windows D: drive that hosts mostly games:

qdirstat-file-type-stats-win-app

By now you can probably tell what my favourite game is... :smile:

shundhammer commented 7 years ago

Now that I invested so much work into that, it won't go away anymore; it will make it to Git master for sure. :smile:

I am still not 100% sure how useful it really is. I did get some unexpected insights into my disk usage, though, and that might be a sign that it's not quite as useless as I initially thought. :smile:

But you can see from these screenshots that your mileage will definitely vary:

Also, there is not yet any kind of communication between this new window and the internal database, the DirTree: Update the stats window when the tree changes, not overdoing the update (wait a few seconds until things have settled down and only then recalculate the statistics etc.).

And of course since I now have that nifty window telling me that there are still junk files around, I want it to find them for me. Right now this is an absolutely passive window (albeit it's non-modal, i.e. you can work in the main window while it is open).

I have some ideas in my head how to do that; maybe applying a filter to the main window based on suffixes I click in the stats window or whatever. But first it needs to get a bit more complete, and it needs to stabilize.

Frankly, I had not expected in the least that QDirStat development would take this turn. But since there are positive results, it's a pleasant surprise. :smiley:

And you know what? Now this thing is a lot better than its WinDirStat counterpart because in WinDirStat that panel mostly serves as a legend to the treemap colors that are ever-changing in WinDirStat. :muscle:

flurbius commented 7 years ago

The best points about this are being able to categorise files that havent been properly identified before, and thus to be able to present some meaningful stats about the whole file system. A few suggestions

shundhammer commented 7 years ago

You could go beyond just using the MIME type and extension to determine file type - many file types can be identified by looking at the file contents.

This is exactly the one thing that cannot be done. I wrote that in my very first comment here. That would mean reading the first few blocks of every file that cannot be properly identified. For a Linux root filesystem that would mean reading the first few blocks of roughly 80.000 (!) files - everything below the "other" category. Just look at the screenshot of my root filesystem above. How long do you think this would take?

This is only feasible for a handful of files - when you invoke file * in the shell, this is only one directory worth of files, not 80.000 of them.

shundhammer commented 7 years ago

Enforcing consistency in the tree structure - I mean making sure that a file only adds to one category (and its parent categories) and not to other branches of the tree. So that percentages are still meaningful and dont end up greater than 100%.

Of course it makes sure that one file only adds to one category. What makes you think it doesn't? Where did you see any percentages that add up to more than 100%?

As a matter of fact, right now it has the exact opposite problem: A large portion of the disk space remains unaccounted for in those statistics. I'll have to check where all that goes missing. Some of it can be explained by directories that also use some disk space, but right now this is way out of proportion. Some more of it may be because of the other file types below "other" that don't belong to the "Top 20" (typically around 100 suffixes), some may be because of stuff that has been classified as "cruft". Probably it can all be explained, but I'd rather make sure and not just guess.

shundhammer commented 7 years ago

Having the configuration of file types, categories, extensions and how they relate to the MIME types stored in a text file that can then be shared with other users. these can be merged and save time for users and possibly even become useful outside of QDirStat.

% cd ~/.config/QDirStat
% cat QDirStat-mime.conf

[MimeCategory_01]
Color=#ff0000
Name=Junk
PatternsCaseInsensitive=*.bak, *.~
PatternsCaseSensitive=core

[MimeCategory_02]
Color=#00ff00
Name=Compressed Archives
PatternsCaseInsensitive=*.7z, *.arj, *.cab, *.cpio.gz, *.deb, *.jar, *.rar, *.rpm, *.tar.bz2, *.tar.gz, *.tgz, *.zip
PatternsCaseSensitive=
...
...

Ta-da... :smile:

This has been available since 2016-06-29: The config file is now split into four independent ones.

flurbius commented 7 years ago

Sorry my mistake about the >100%, I had read the following quote and misremembered it

I will probably add tabs to switch between the categories (seen here on top) and the extensions -- just to avoid confusion because otherwise items will be counted twice, and the sum of all percentages will be way above 100%.

flurbius commented 7 years ago

I saw the QDirStat-mime.conf file but thats not quite what I had in mind, though it does go someway towards it. I accept your claim that it is too arduous a task to read all the files (for now) but it may be ok to have a background process that does it for the files that arent identified by other means - or perhaps if the user requests it - say they are looking at a sub-directory with a limited number of files a mere 1,000 or so, or they are prepared to let it chug away while they have dinner.

Obviuously a better solution is needed - extensions are a hack and cant be trusted, and reading the contents is primitive, prone to error and also a hack. MIME types are not a hack and the system is extensible but it seems to already be out of control - too much and at the same time too little, might just need some standardisation.

I have seen somewhere (not sure which distro or DE) a file explorer that allowed me to sort files by "detailed filetype" (or MIMEtype) which is the closest I have seen to sorting by extension in windows explorer (probably the only windows feature I actually miss).

C0rn3j commented 7 years ago

@flurbius I think you mean Nautilus(though other file explorers may have the capability too) http://i.imgur.com/FaCWl52.png

shundhammer commented 7 years ago

No, Nautilus does not do that; it also just looks at filename extensions.

I just renamed a .png file to .txt. file still shows it correctly as "PNG image data". Nautilus now shows it as "Text". Not even "Properties" from the context menu does anything else.

shundhammer commented 7 years ago

I found all the missing items: I used the wrong way of iterating over the tree which disregarded the DotEntries. Now that I use the iterator class I had created for just this purpose many years ago, the sums add up to just below 100% (a little space is still used for directory nodes etc.).

C0rn3j commented 7 years ago

No, Nautilus does not do that; it also just looks at filename extensions. Oh, my bad!

the sums add up to just below 100% Great to hear!

I've tried the branch out and it looks perfect, the only thing I'd like(besides highlighting files by extension, but that's just a nice to have) is adding an All tab(not necessarily having all records in it, it's just that they wouldn't group), kind of like in the list you posted here, just sorted.

Again, thanks a lot for the feature!

shundhammer commented 7 years ago

Merged to Git master.

shundhammer commented 7 years ago

Next things to come: Use the file type stats window to locate the corresponding files in the main window.

I wrote that this is not supposed to be a replacement for the find command, yet when I see stuff like *.bak files in the Junk category, I want to find them to get rid of them.

Right now the idea is to search the tree again for all files with that suffix and list the results in another (again non-modal) window; probably just one entry per directory that contains any of them. When you click on such a result, it would open that branch in the tree of the main window and select (highlight) all files with that suffix.

That will work well with a handful of files; I don't know yet how to do this efficiently with things like my photo collection: It's 29,000+ JPG files scattered over 850+ directories. That's a bit much to navigate in. Would I want only my /work/photos directory to appear there? That's a bit limiting. I'll have to think about that.

shundhammer commented 7 years ago

See https://github.com/shundhammer/qdirstat/issues/48 for that new idea and any follow-up discussions.

shundhammer commented 7 years ago

the only thing I'd like(besides highlighting files by extension, but that's just a nice to have) is adding an All tab(not necessarily having all records in it, it's just that they wouldn't group), kind of like in the list you posted here, just sorted.

I toyed with that idea, but so far I couldn't come up with a good concept to have this non-disruptive: Right now all the percentages add up to (about) 100%; this is what users expect.

With an All category in the same tree, however, everything would show up twice, so the overall sum would be about 200%. So it would have to be either tabs or a check box (or combo box or radio box) to switch between the existing By Category view and that new All view.

This would considerably add to the complexity of that window; that's the major gripe the Apple fans always have with Linux things. They love the beauty of the simplicity that is so prevalent in Apple products, and they do have a point there.

So, again, what is the use case for that? A user would surely have a general idea what general category a file type belongs to, right? So given the small number of those categories, any suffix is much easier and quicker to locate right now than having to sift through some 500 suffixes in al All view, right?

And comparing the cumulative sizes of, say, all .jpg against all .mp4 on a disk is literally comparing apples (note the lower-case 'a' here :smiley: ) to oranges. So I don't see the point (but then, we've been there before).

C0rn3j commented 7 years ago

So, again, what is the use case for that?

Personally I prefer to see everything at once without opening many tree menus.

I will adapt fine to the way it is currently done as the groups themselves are sorted by size, it's just not what I'm used to.

EDIT: Upon further inspection I see you made the categories editable, I'll just create my own monster category with everything if I'll need it.