shundhammer / qdirstat

QDirStat - Qt-based directory statistics (KDirStat without any KDE - from the original KDirStat author)
GNU General Public License v2.0
1.64k stars 122 forks source link

Locate files with no extension from file type statistics #130

Closed elicoten closed 3 years ago

elicoten commented 4 years ago

Firstly, thank you so much for #45 and #48. I've been looking for these features for years (obviously last time I tried QDirStat was an older version or before those features got added).

I'd been writing a collection of rather unwieldy shell scripts to try to concoct a series of find commands that would provide the same information, so having it built-in to QDirStat is a massive help.

I have a few questions / suggestions about this feature that I'd appreciate your thoughts on please:-

  1. Would it be possible to list all the files under "No Extension" - it seems that the "Locate" and "size statistics" options for these files are disabled?

The reason I ask is because I've been trying to find these files to get an idea of what they are. I tried the following find command but there to be some discrepancy in the number of files reported:

$ find \( -not -iname '*.*' -o -iname '.*' \) -not -type d -fprintf noextension.csv '%s,%h,%f\n'

This yields 82,658 files. However according to QDirStat there are actually 83,487 files in my home directory.

  1. Related to this, is it possible to make a way to quickly export/copy & paste all the information shown in the "Locate files by type" window for a given type (I guess some kind of text or csv format would be most convenient for including all the columns) - that way it could be saved to a text file or spreadsheet for further manipulation?

  2. Finally, I note your comments about 'cruft' files. Is there a convenient way to do a similar analysis of these - perhaps give them all a separate category and again a way to see number of cruft files and total size wasted in cruft files per directory, which might be useful in trying to clean up some of them?

Even if the answers to the above are no, no and no - thank you anyway as I really appreciate this feature that you've added.

shundhammer commented 4 years ago

To put this more into perspective so we get an idea of a real world scenario: This is the root filesystem of my Xubuntu 18.04 LTS that I use at home (OS stuff only, home directory and data files are on separate filesystems):

xubuntu-18-04-lts-root

file-types

As usual, files in the No Extension category make up a whopping 33.96% of all files: 28338 of 74356 total.

This is by no means an exception; this is pretty much the normal case for Linux systems.

The log tells us a bit more about the cruft files:

2020-05-06 10:04:32.574 [4236] <Debug>   FileTypeStats.cpp:223 removeCruft():  Merged 457 suffixes to <NO SUFFIX>: *.0, *.0-0, *.0-0v5, *.0-96-generic, *.0-99-generic, *.0-dev, *.0-libs, *.0-runtime, *.0-vimrc, *.001, *.002, *.003, *.0rv, *.1, *.1-17, *.1-releasenotes, ...
2020-05-06 10:04:32.575 [4236] <Debug>   FileTypeStats.cpp:226 removeCruft():  Merged: 2465 files (2.0 GB)

Making this a bit less unwieldy with some creative sed reformatting: (click on "Details" to open)

``` 2020-05-06 10:04:32.574 [4236] FileTypeStats.cpp:223 removeCruft(): Merged 457 suffixes to : *.0 *.0-0 *.0-0v5 *.0-96-generic *.0-99-generic *.0-dev *.0-libs *.0-runtime *.0-vimrc *.001 *.002 *.003 *.0rv *.1 *.1-17 *.1-releasenotes *.10 *.100 *.11 *.12 *.122 *.126 *.13 *.14 *.146 *.15 *.152 *.16 *.17 *.175 *.18 *.19 *.1976-0 *.1a98-intel-edk2-2-tplg *.2 *.2-7 *.20 *.2018 *.2019 *.2020 *.2021 *.2022 *.21 *.21-key-documentation *.21-rc *.22 *.23 *.24 *.25 *.25nv *.25rv *.26 *.26-x86_64-linux-gnu *.27 *.28 *.29 *.2debian *.3 *.3-x86_64-linux-gnu *.30 *.31 *.32 *.34 *.3v5 *.4 *.4-0 *.4-2 *.43 *.48 *.49 *.5 *.56 *.59 *.6 *.6-minimal *.6-stdlib *.64 *.6_installed *.6m *.7 *.7-7 *.7-minimal *.7-stdlib *.78 *.8 *.9 *.9050 *.9052 *.913d *.97 *.9wmrc-menu *.abs-guide *.accelkeys *.acoustic *.adc65 *.add-on-package-maintainers *.agfa-cl20 *.aic79xx *.aic7xxx *.amd64 *.anacron *.announce *.applications *.argentina *.asm-generic *.atheros_firmware *.atril-backend *.australia *.auth-pam *.automount *.b03 *.b04 *.bash_logout *.bash_profile *.beginners-info *.belgium *.bin-48khz_i2s_master *.bindings *.birthday *.blacklist *.buildinfo *.builtin *.c++ *.ca0132 *.caja-extension *.canonical *.centralized *.cfg-debian *.cfg-texlivedist *.changes *.charset *.chatscript *.checkpatch *.chelsio_firmware *.christian *.cleanup *.clicksmart310 *.cnf-debian *.cnf-texlivedist *.com_opera-stable_dists_stable_inrelease *.com_opera-stable_dists_stable_non-free_binary-amd64_packages *.com_opera-stable_dists_stable_non-free_binary-i386_packages *.com_ubuntu_dists_bionic-backports_inrelease *.com_ubuntu_dists_bionic-backports_main_binary-amd64_packages *.com_ubuntu_dists_bionic-backports_main_binary-i386_packages *.com_ubuntu_dists_bionic-backports_main_i18n_translation-en *.com_ubuntu_dists_bionic-backports_universe_binary-amd64_packages *.com_ubuntu_dists_bionic-backports_universe_binary-i386_packages *.com_ubuntu_dists_bionic-backports_universe_i18n_translation-en *.com_ubuntu_dists_bionic-security_inrelease *.com_ubuntu_dists_bionic-security_main_binary-amd64_packages *.com_ubuntu_dists_bionic-security_main_binary-i386_packages *.com_ubuntu_dists_bionic-security_main_i18n_translation-en *.com_ubuntu_dists_bionic-security_multiverse_binary-amd64_packages *.com_ubuntu_dists_bionic-security_multiverse_binary-i386_packages *.com_ubuntu_dists_bionic-security_multiverse_i18n_translation-en *.com_ubuntu_dists_bionic-security_restricted_binary-amd64_packages *.com_ubuntu_dists_bionic-security_restricted_binary-i386_packages *.com_ubuntu_dists_bionic-security_restricted_i18n_translation-en *.com_ubuntu_dists_bionic-security_universe_binary-amd64_packages *.com_ubuntu_dists_bionic-security_universe_binary-i386_packages *.com_ubuntu_dists_bionic-security_universe_i18n_translation-en *.com_ubuntu_dists_bionic-updates_inrelease *.com_ubuntu_dists_bionic-updates_main_binary-amd64_packages *.com_ubuntu_dists_bionic-updates_main_binary-i386_packages *.com_ubuntu_dists_bionic-updates_main_i18n_translation-en *.com_ubuntu_dists_bionic-updates_multiverse_binary-amd64_packages *.com_ubuntu_dists_bionic-updates_multiverse_binary-i386_packages *.com_ubuntu_dists_bionic-updates_multiverse_i18n_translation-en *.com_ubuntu_dists_bionic-updates_restricted_binary-amd64_packages *.com_ubuntu_dists_bionic-updates_restricted_binary-i386_packages *.com_ubuntu_dists_bionic-updates_restricted_i18n_translation-en *.com_ubuntu_dists_bionic-updates_universe_binary-amd64_packages *.com_ubuntu_dists_bionic-updates_universe_binary-i386_packages *.com_ubuntu_dists_bionic-updates_universe_i18n_translation-en *.com_ubuntu_dists_bionic_inrelease *.com_ubuntu_dists_bionic_main_binary-amd64_packages *.com_ubuntu_dists_bionic_main_binary-i386_packages *.com_ubuntu_dists_bionic_main_i18n_translation-de *.com_ubuntu_dists_bionic_main_i18n_translation-en *.com_ubuntu_dists_bionic_multiverse_binary-amd64_packages *.com_ubuntu_dists_bionic_multiverse_binary-i386_packages *.com_ubuntu_dists_bionic_multiverse_i18n_translation-de *.com_ubuntu_dists_bionic_multiverse_i18n_translation-en *.com_ubuntu_dists_bionic_restricted_binary-amd64_packages *.com_ubuntu_dists_bionic_restricted_binary-i386_packages *.com_ubuntu_dists_bionic_restricted_i18n_translation-de *.com_ubuntu_dists_bionic_restricted_i18n_translation-en *.com_ubuntu_dists_bionic_universe_binary-amd64_packages *.com_ubuntu_dists_bionic_universe_binary-i386_packages *.com_ubuntu_dists_bionic_universe_i18n_translation-de *.com_ubuntu_dists_bionic_universe_i18n_translation-en *.commemorative *.compiled *.computer *.conf_jack *.conf_oss *.conf_pulse *.configure-options *.contents *.cp1251 *.cp737 *.cputype *.creatortheme *.croatian *.cs4236 *.cw1200 *.cyrtexinfo *.d:20auto-upgrades *.d:50unattended-upgrades *.dat-old *.deb822 *.debian-source *.debianmaints *.deconfig *.default *.defaults *.devhelp2 *.devices *.devname *.dib0700 *.discordian *.disk-spindown *.dosfsck *.down-root *.dpkg-cache *.dpkg-dist *.dtbinst *.dvconnect *.e100 *.echoaudio *.emacs25 *.emerald *.encoder *.ene_firmware *.enigma13 *.example2 *.examples *.exfat-fuse *.extrawarn *.fallback *.feiertag *.ffpreset *.freebsd *.freezer *.functions *.fw2 *.fw_sst_0f28 *.gcc-plugins *.geschichte *.geteltorito *.gpl3-except *.gresource *.gsmart300 *.h_shipped *.headersinst *.hfi1_firmware *.history *.holiday *.holidays *.hungarian *.hwufinndpa *.i2400m *.i915 *.ibt_firmware *.intcsst2 *.in~ *.iosched *.ipu3_firmware *.iputils *.is-installed *.isdn4linux *.it913x *.iwlwifi_firmware *.jed-support *.jferies *.journal~ *.kazakhstan *.konsole *.konsole-256color *.largan-lmini *.legal-displayed *.lib-multi-threading *.licenses *.linux-m1 *.linux-m1b *.linux-m2 *.literatur *.m2d *.m2m *.machine *.mailutils *.manifest *.markdown *.marvell *.mbedtls *.mcommemorative *.mediatek *.megaraid *.messages *.military *.minitel1 *.minitel1-nb *.minitel1b *.minitel1b-80 *.minitel1b-nb *.minolta-dimagev *.mirrors *.mkdosfs *.mlterm-256color *.mod16 *.modbuiltin *.modinst *.modpost *.modsign *.modulemap *.monitor *.monthly *.mschap81 *.multiarch *.myri10ge_firmware *.net_graphics-drivers_ppa_ubuntu_dists_bionic_inrelease *.net_graphics-drivers_ppa_ubuntu_dists_bionic_main_binary-amd64_packages *.net_graphics-drivers_ppa_ubuntu_dists_bionic_main_binary-i386_packages *.net_graphics-drivers_ppa_ubuntu_dists_bionic_main_i18n_translation-en *.netronome *.network *.nevnapok *.newzealand *.nimproject *.notifyrc *.nullmodem *.o_binary *.openbsd *.optimization *.options *.ordering *.original *.orthodox *.override *.p4b *.p7s *.panasonic *.panasonic-coolshot *.panasonic-l859 *.panel-applet *.pc8 *.pccam300 *.pccam600 *.platform *.platforms *.postlink *.powerpc *.pppol2tp *.praznici *.preempt *.previous *.problems *.properties *.proverbes *.putty-256color *.putty-m1 *.putty-m1b *.putty-m2 *.python_switch *.qat_firmware *.qdocinc *.qla1280 *.qla2xxx *.qmlproject *.qualcommatheros_ar3k *.qualcommatheros_ath10k *.r8a779x_usb3 *.ralink_a_mediatek_company_firmware *.recommended *.recursion-issue-01 *.recursion-issue-02 *.release *.replchars *.request *.rockchip *.rtinstall *.rtremove *.runtime_deps *.russian *.schemas *.scowl-word-lists-used *.screenshots *.scripts *.sdma_firmware *.security *.select-break *.select-ispell *.select-wordlist *.sf7 *.shutdown *.signature *.softdep *.sonydscf1 *.soundvision *.southafrica *.specchars *.st2205 *.static-ip *.substvars *.supported *.symvers *.sys-old *.targets *.teraterm *.termcap *.terminfo *.thumbnailer *.ti-connectivity *.ti-keystone *.toshiba-pdrm11 *.tp6801 *.translations *.translators *.twmrc-menu *.ubraille *.ucf-dist *.udisks2 *.ueagle-atm4-firmware *.ukrainian *.unitedkingdom *.unnepek *.usb-quirks *.usholiday *.v2 *.vbox-extpack *.vcxproj *.vendor-strings *.version *.via_vt6656 *.vim-tiny *.vsmacros *.vte-256color *.warning *.whiptail *.wissenschaft *.wl1251 *.wrapper *.x86 *.xc4000 *.xc5000 *.xc5000c *.xcscheme *.xcsettings *.xmlcatalogs *.xterm-256color *.xterm-new *.xterm-r6 *.xterm-xfree86 *.z77 *.zd1201 *.zd1211 ```

That's 457 suffixes.

shundhammer commented 4 years ago

Freak accident? Are root filesystems so special in having a lot of files that cannot easily be categorized? Okay, let's give it another try: My /work filesystem where I keep my photos, my music collection, some videos, and of course the source directories (git checkouts) where I work:

work

file-types-work

Okay, that's considerably less files in the No Extension category: Only 0.19%, but still 11728 out of 83368 total.

So what about cruft here? Let's check the log:

``` 2020-05-06 10:26:40.948 [4716] FileTypeStats.cpp:223 removeCruft(): Merged 416 suffixes to : *. : extinction agenda, *. bar, *. the 3rd reich, *.0, *.0-beta1, *.0-beta2, *.0-beta3, *.0-beta4, *.0-beta5, *.0-beta6, *.0-garden, *.0-rc1, *.0-tp1, *.00, *.0005, *.00beta1, *.00beta2, *.00beta3, *.01, *.02, *.03, *.04, *.05, *.0a, *.1, *.10, *.11, *.12, *.13, *.15, *.15 patch, *.16, *.18, *.19, *.1_gm, *.2, *.2-tower, *.23037, *.26, *.26498, *.27, *.2ceping, *.3, *.30, *.31, *.35, *.39-19980327, *.39-19980406, *.39-19980414, *.39-19980506, *.39-19980529, *.39-19980611, *.39-19980616, *.39-19980623, *.39-19980625, *.39-19980706, *.3ce-tp1, *.3ceconan, *.3cekicker, *.3cesweetandsour, *.4, *.4-temple, *.40, *.41, *.42, *.4315, *.5, *.5512, *.6, *.6-1-default, *.6-import, *.62, *.64, *.7, *.7-1-default, *.8, *.9, *.9-fix-ubuntu, *.92, *.93, *.94, *.95, *.96, *.98, *.99, *.: call of pripyat, *.: clear sky, *.: shadow of chernobyl, *.aix433, *.announce, *.autoinstall, *.browser, *.check_cache, *.com - 1nsane, *.com - a new beginning: final cut, *.com - advent rising, *.com - age of wonders, *.com - age of wonders 2: the wizard's throne, *.com - age of wonders: shadow magic, *.com - alien breed and tower assault, *.com - alone in the dark 1, *.com - alone in the dark 2, *.com - alone in the dark 3, *.com - alpha centauri, *.com - anachronox, *.com - another world: 15th anniversary edition, *.com - anvil of dawn, *.com - arcanum, *.com - arma: cold war assault, *.com - arx fatalis, *.com - avadon: the black fortress, *.com - avernum: the complete saga, *.com - back to the future the game, *.com - baldur's gate ii complete, *.com - baldur's gate: the original saga, *.com - balls of steel, *.com - betrayal at krondor pack, *.com - beyond divinity, *.com - beyond good and evil, *.com - bioforge, *.com - blackwell bundle, *.com - blade of darkness, *.com - blood 2: the chosen and the nightmare levels, *.com - bloodrayne, *.com - broken sword 1: shadow of the templars, *.com - broken sword 2: the smoking mirror, *.com - broken sword 3: the sleeping dragon, *.com - broken sword 4: the angel of death, *.com - caesar iii, *.com - cannon fodder, *.com - carmageddon max pack, *.com - catacombs pack, *.com - chronicles of riddick: assault on dark athena, *.com - clive barker's undying, *.com - commandos 2 and 3, *.com - commandos ammo pack, *.com - conflict: desert storm, *.com - crusader: no regret, *.com - crusader: no remorse, *.com - daikatana, *.com - dark fall: lights out, *.com - dark fall: the journal, *.com - dark reign and expansion, *.com - darklands, *.com - darkstone, *.com - deponia, *.com - descent 3, *.com - deus ex 2: invisible war, *.com - deus ex goty edition, *.com - disciples 2 gold, *.com - divine divinity, *.com - dragonsphere, *.com - dreamfall: the longest journey, *.com - duke nukem 1 and 2, *.com - duke nukem 3d atomic edition, *.com - duke nukem: manhattan project, *.com - dungeon keeper, *.com - dungeon keeper 2, *.com - dungeons and dragons: dragonshard, *.com - earthworm jim 1 and 2, *.com - enclave, *.com - evil genius, *.com - fallout, *.com - fallout 2, *.com - fallout tactics, *.com - far cry, *.com - far cry 2, *.com - fez, *.com - fez patch, *.com - flatout, *.com - forgotten realms: demon stone, *.com - freespace 2, *.com - gabriel knight 1: sins of the fathers, *.com - gabriel knight 2: the beast within, *.com - gabriel knight 3: blood of the sacred, of the damned, *.com - gemini rue, *.com - giants: citizen kabuto, *.com - gobliiins pack, *.com - gothic, *.com - gothic 2 gold, *.com - gothic 3, *.com - ground control 2, *.com - guilty gear x2 reload, *.com - heroes of might and magic 3 complete, *.com - heroes of might and magic 3 hd mod, *.com - hitman 2: silent assassin, *.com - hitman: codename 47, *.com - hotline miami, *.com - i have no mouth and i must scream, *.com - icewind dale complete, *.com - icewind dale ii complete, *.com - imperialism 2: the age of exploration, *.com - in cold blood, *.com - incoming and incoming forces, *.com - incredipede, *.com - iron storm, *.com - ishar compilation, *.com - jack orlando: a cinematic adventure, *.com - jagged alliance 2, *.com - kingpin: life of crime, *.com - knights and merchants: the peasants rebellion, *.com - lands of lore 1 and 2, *.com - lands of lore 3, *.com - legacy of kain: defiance, *.com - legacy of kain: soul reaver, *.com - legacy of kain: soul reaver 2, *.com - legend of grimrock, *.com - lionheart: legacy of the crusader, *.com - little big adventure, *.com - little big adventure 2, *.com - lure of the temptress, *.com - magic carpet, *.com - magic carpet 2: the netherworlds, *.com - magrunner: dark pulse, *.com - master of magic, *.com - master of orion 1 and 2, *.com - mdk, *.com - mdk 2, *.com - medal of honor: allied assault war chest, *.com - megarace, *.com - might and magic 6-pack, *.com - might and magic 7, *.com - might and magic 8, *.com - moonbase commander, *.com - myst uru complete chronicles, *.com - neighbours from hell compilation, *.com - neverwinter nights 2: complete, *.com - neverwinter nights diamond edition, *.com - nexus: the jupiter incident, *.com - normality, *.com - nox, *.com - oddworld: abe's exoddus, *.com - oddworld: abe's oddysee, *.com - one unit whole blood, *.com - original war, *.com - outcast, *.com - outcast: high-resolution patch, *.com - painkiller: black edition, *.com - papers, please, *.com - pathologic classic hd, *.com - perimeter, *.com - personal nightmare, *.com - pharaoh and cleopatra, *.com - pirates pack, *.com - planescape: torment, *.com - pod gold, *.com - populous 2, *.com - populous 3: the beginning, *.com - populous 3: the beginning - high-resolution patch, *.com - postal 2 complete, *.com - praetorians, *.com - prince of persia: the sands of time, *.com - pro pinball timeshock, *.com - psychonauts, *.com - puddle, *.com - quest for glory, *.com - rayman forever, *.com - rayman origins, *.com - realms of arkania 1 and 2, *.com - realms of arkania 3, *.com - realms of the haunting, *.com - redneck rampage collection, *.com - resonance, *.com - return to mysterious island, *.com - riven: the sequel to myst, *.com - rollercoaster tycoon 2, *.com - rollercoaster tycoon 3, *.com - runaway 3: a twist of fate, *.com - runaway: a road adventure, *.com - s2: silent storm gold edition, *.com - sacred gold, *.com - sacrifice, *.com - sam and max save the world, *.com - sam and max: beyond time and space, *.com - sam and max: the devils playhouse, *.com - sanitarium, *.com - screamer, *.com - screamer 2, *.com - sensible world of soccer 96-97, *.com - septerra core: legacy of the creator, *.com - seven kingdoms: ancient adversaries, *.com - shadow warrior complete, *.com - shadowgrounds, *.com - sherlock holmes: secret of the silver earring, *.com - shogo: mobile armor division, *.com - silver, *.com - simcity 2000 special edition, *.com - simon the sorcerer, *.com - simon the sorcerer 2, *.com - slipstream 5000, *.com - space quest 1, 2, 3, *.com - space quest 4, 5, 6, *.com - space rangers 2: reboot, *.com - splinter cell, *.com - spycraft: the great game, *.com - star wolves, *.com - starflight 1 and 2, *.com - startopia, *.com - still life, *.com - stonekeep, *.com - stronghold hd, *.com - syberia, *.com - syberia 2, *.com - syndicate, *.com - system shock 2, *.com - tales of monkey island, *.com - teenagent, *.com - temple of elemental evil, *.com - temple of elemental evil patch co8 modpack 7, *.com - tex murphy: the pandora directive, *.com - tex murphy: under a killing moon, *.com - the book of unwritten tales, *.com - the book of unwritten tales: the critter chronicles, *.com - the chaos engine remastered, *.com - the dark eye: chain of satinav, *.com - the feeble files, *.com - the incredible machine mega pack, *.com - the interstate 76 arsenal, *.com - the last express, *.com - the longest journey, *.com - the nations gold edition, *.com - the settlers 2: 10th anniversary, *.com - the whispered world special edition, *.com - the witcher enhanced edition directors cut, *.com - theme hospital, *.com - thief 2, *.com - thief 3: deadly shadows, *.com - thief gold, *.com - tom clancy's ghost recon, *.com - torchlight, *.com - torins passage, *.com - total annihilation: commander pack, *.com - total annihilation: kingdoms, *.com - tropico 3 gold edition, *.com - tropico reloaded, *.com - two worlds: epic edition, *.com - tyrian 2000, *.com - ufo: afterlight, *.com - ufo: aftermath, *.com - ultima 1, 2, 3, *.com - ultima 4, *.com - ultima 4, 5, 6, *.com - ultima 8 gold edition, *.com - ultima worlds of adventure 2: martian dreams, *.com - unreal 2, *.com - unreal gold, *.com - unreal tournament 2004 ece, *.com - unreal tournament goty, *.com - urban chaos, *.com - wallace and gromits grand adventures, *.com - warlords battlecry 3, *.com - wing commander 1 and 2, *.com - wing commander 3: heart of the tiger, *.com - wing commander: privateer, *.com - wizardry 8, *.com - worlds of ultima: the savage empire, *.com - xiii, *.com - zeus and poseidon, *.com - zork grand inquisitor, *.default, *.defs-example, *.digest-sha1, *.disable_highdpi, *.dynlist, *.example, *.first-stage, *.firstboot, *.gpl-3, *.gpl-except, *.gpl3-except, *.graphml, *.includecache, *.installer, *.isolate, *.json-array, *.json-array_pretty, *.lesserv2, *.lesserv3, *.lgpl-nogpl2, *.lgpl-only, *.lgpl3-comm, *.libpath, *.markdown, *.net edition, *.nothing, *.os2, *.profile, *.program, *.ps1, *.q42, *.qt-license-agreement, *.s390, *.second-stage, *.security-checksig, *.service, *.settings, *.sha256, *.sha256sum, *.solaris, *.sqlite-shm, *.sqlite-wal, *.template, *.toplevel, *.tracepoints, *.user_list, *.utf-8, *.valgrind, *.vbox-prev, *.web-page-replay, *.win32, *.windows, *.x11, *.x86, *.x86_64, *.xcscheme, *.xcsettings ```

Yikes; still 426 suffixes, and many of them really weird. Duh.

shundhammer commented 4 years ago

Just looking at the treemap also gives some hints: All the grey areas are either directories with a multitude of tiny files (so small that they are not rendered in the treemap because performance would severely suffer) or files in the Other category a.k.a. "we have no clue what all that stuff is".

shundhammer commented 4 years ago
  1. Would it be possible to list all the files under "No Extension" - it seems that the "Locate" and "size statistics" options for these files are disabled?

Yes, locating them is disabled for several reasons:

That was pretty much the trade-off to get that file type feature in the first place: Keep it simple and apply the concept only to those files that can reasonably be identified and put all the rest into the "great unknown" category simplistically named "No Extension" in the File Type window.

The ugly truth is it's not exactly all just "No Extension", it also includes "weird extensions". This tidbit of information is kind of held back from normal users because it would just confuse most of them with little gain for anyone.


$ find \( -not -iname '*.*' -o -iname '.*' \) -not -type d -fprintf noextension.csv '%s,%h,%f\n'

This yields 82,658 files. However according to QDirStat there are actually 83,487 files in my home directory.

The difference might (!) be because of the cruft files. They have extensions, albeit typically very weird ones. Many of them might simply be unknown to QDirStat's preconfigured MIME categories.

  1. Related to this, is it possible to make a way to quickly export/copy & paste all the information shown in the "Locate files by type" window for a given type (I guess some kind of text or csv format would be most convenient for including all the columns) - that way it could be saved to a text file or spreadsheet for further manipulation?

That window contains only the directories where they are. Is this what you want? Or would you rather like to have a complete list of all the files in that category?

  1. Finally, I note your comments about 'cruft' files. Is there a convenient way to do a similar analysis of these - perhaps give them all a separate category and again a way to see number of cruft files and total size wasted in cruft files per directory, which might be useful in trying to clean up some of them?

A very pedestrian approach to that would be grepping the QDirStat log for that line listing the extensions that were considered cruft, splitting it up like I did in my previous comments (a very simple sed line) and then invoke find with that.

shundhammer commented 4 years ago

You just did something really naughty: You raised my interest. :smiley:

Yes, indeed: What the hell are all those cruft files? How can we get more information about them?

But rather than just starting some creative scripting, QDirStat might have something built in that could be pressed into service.

You might or might not have read about the Packages view in QDirStat where it shows what software package each file belongs to. There is also its counterpart Unpackaged Files that shows the opposite: "What files do not belong to an installed software package?"

Sounds unrelated? Bear with me.

That view is built all around one feature that was really expensive and hard to implement: Ignoring files in the directory tree.

QDirStat-unpkg-usr-share-qt5

In this case, it ignores files that are known to belong to an installed software package. All directories that only contain ignored files are shown dimmed (light grey) in the tree view, and the files are not displayed in the treemap, leaving only the unpackaged files, i.e. those that are not part of an installed package.

shundhammer commented 4 years ago

If we want to know more about cruft files or files with no filename extension, how about doing something similar for them? Rebuild the entire directory tree and ignore files that belong to a well-known MIME category?

That would leave the entire remaining treemap uncolored because color there means we know what a file is. But it would make it so much easier to spot cruft files that are worth taking care of, i.e. large ones.

shundhammer commented 4 years ago

One major problem with that is that a large number of files without suffix are executables: Binaries or scripts in a plethora of different scripting languages.

To be of any use, that stuff would need to be filtered out as well.

This reopens that old discussion of using the file command or doing something similar which comes down to reading at least a portion of the file and analyzing with some heuristics what it might be.

This is an expensive operation; in the case of a root filesystem, that would mean reading those 33.94% of all files, 28338 in the above case. Yes, it's just a partial read; just reading the first few blocks. Still, it's 28k open(), read() and close() syscalls. That will take quite a while; no matter if this is delegated to an external file command and parsing its output or if it is reimplemented internally (using libmagic which does that for the file command).

Hm. Not good.

shundhammer commented 4 years ago

Let me think about this for a while.

In the meantime, please elaborate some more on your use case. Is it more than just curiosity and exploring the unknown? This is where the ignored view would come in, and I am very much inclined to pursue that further.

Or is it a very specific use case? If so, please describe it.

elicoten commented 4 years ago

Thanks for your thoughts on this.

The log tells us a bit more about the cruft files:

That may prove to be very useful, at least for identifying them.

The ugly truth is it's not exactly all just "No Extension", it also includes "weird extensions". This tidbit of information is kind of held back from normal users because it would just confuse most of them with little gain for anyone.

Interesting, that might well explain the discrepancy.

In most cases, there are way too many of them in way too many different directories. You'd get a window with basically every directory in the tree listed, making it hard to navigate there, and performance of that Locate by Type window would become really poor.

I see what you mean, but it would still be useful to be able to find the directories with the largest number of cruft files, or probably more usefully - the largest amount of space taken up / wasted by them. Perhaps it could show the top 50 or 100 directories similar too the way it does top 20 for the file extensions. (Not wanting to get side-tracked but I would find it useful to be able to configure that number also for the uncategorised file types).

That window contains only the directories where they are. Is this what you want? Or would you rather like to have a complete list of all the files in that category?

Directories (with the associated statistics - number of files and total size of those files) would be a good starting point. I wouldn't say no to being able to drill down to get/export/copy-paste the entire list of individual files as well, but if I had to choose files or directories I'd go with directories, not least because its consistent with the UI. Another reason is that on my system I think the list of files would be too large to do anything useful with. Before playing around with QDirStat, I tried a find command just to write the name and size of every file on the system into an enormous text file in the hope of playing around with it in a spreadsheet (pivot tables, etc), but there were way too many files for my spreadsheet to cope with, so I abandoned that idea and save the speadsheet only for smaller analysis on a smaller number of files.

A very pedestrian approach to that would be grepping the QDirStat log ... and then invoke find with that.

Yes that's exactly what I'm thinking I might try, though as previously discussed the result might not be that useful. I'm making the following numbers up, but a list of, say, 400 directories is easier to work with than a list of 80,000 files!

Yes, indeed: What the hell are all those cruft files? How can we get more information about them?

That is definitely a part of the question I'm trying to answer (see more detail below).

Rebuild the entire directory tree and ignore files that belong to a well-known MIME category?

This would definitely be useful - in fact, more useful than the "locate by type" option. If you did implement this, you could make the feature more discoverable and hence useful to a wider number of people if by having a message box that pops-up when you try to use the "locate" option on the "No Extensions" line in the "Locate files by type" dialog we were originally discussing. The message box could direct the user towards this feature i.e. suggesting that they re-scan the directory tree ignoring files of known MIME categories

That would leave the entire remaining treemap uncolored

Maybe a different device (i.e. other than colours) could be employed to give an indication of file sizes, or perhaps various shades of grey or patterns... or maybe colours could be used here but given a different meaning, though I'm not sure if that's good UX practice. Hmm these are just random thoughts I'm not certain how useful that would be as I write this.

One major problem with that is that a large number of files without suffix are executables

That isn't necessarily a problem. After all these files are still taking up space on the disk and they are also files that cannot easily be categorised by file extension. If for some reason /usr/bin happens to be taking up an enormous amount of space, (or more likely, some kind of self-extracting package that is part-script, part-binary file), I still want to know which files are the culprits. Having said that, perhaps directories that are likely to contain executables could be coloured or otherwise displayed differently in the treemap. Or could executables be identified from their file permissions (i.e. any file with the executable bit set?)

This reopens that old discussion of using the file command or doing something similar which comes down to reading at least a portion of the file and analyzing with some heuristics what it might be.

Clearly it would take too to do this for every cruft file, but perhaps we could do this just for the largest individual files (e.g. the largest 100 cruft files, again perhaps it would be good to be able to configure this - or to run it on demand e.g. "categorise the next uncategorised 50 files" using the file command or similar). Then we might need another way to identify which directories have the largest total space occupied by cruft files, which would help at least pinpoint a directory that contains a very large number of tiny uncategorisable files. Then the user might be able to make a decision about them based on which directories are the biggest contributors. If I have 400 directories containing cruft files, but 20 of those directories are occupying 90% of the space, then obviously I want to focus on those 20 directories, and hence it would be useful to know which directories those are.

In the meantime, please elaborate some more on your use case.

I have a 1.5Tb hard drive drive and my home directory is occupying almost 2/3 of that space. Within that is 40Gb of 'No Extension' files! That's quite a lot of space that's potentially unaccounted for.

I'm trying to work out:

If it turns out to be mostly static stuff like photos and videos, I might buy a second hard drive, copy them onto it and leave it in a remote location. How many 'document'-type files (text/word processing, spreadsheets, etc) do I have, that are small but might change more frequently. These might be good candidates for a cloud storage service. But upload bandwidth is very limited (1Mbps) so I have to be very selective about what I store in the cloud if I don't want to be waiting many years. I may find quite a lot of files that I don't need to back up at all e.g. cache/binaries/system files

I realise that's quite a tall order which is why I set out writing scripts to run find. But I think QDirStat can help to answer quite a large proportion of these question, so in a nutshell, can I delete those 40Gb of files? Well even if I could I need to find them first to be able to delete them. If not, what is the nature of them? Do they need to be backed up? Can they be compressed to save space? Is that just something I'll have to put up with?

Thanks again for your thoughts and comments on this - it's all very much appreciated.

shundhammer commented 4 years ago

Before this goes off-track a bit too far: Those "cruft" files are not to be confused with "junk" files. If disk space is running low (or you are just being extra tidy), it makes perfect sense to get rid of "junk" files such as editor backup files (*.bak, *.auto, *~), core dumps etc.; they are redundant most of the time and only rarely useful.

All those weird filename extensions in the Details lists above are simply Linux developers being a bit too creative with naming their files. On MS-DOS and thus on all types of MS Windows, filename extensions have clear implications, so it's not advisable to misappropriate the concept for general separator characters in filenames. On Linux / Unix, however, it's little more than a convention (still, it tends to confuse a lot of tools if dots are used as just another character in filenames).

So, the basic problem here is not so much about those "cruft" files that use things that look like filename extensions, but are often just a regular part of the filename.

Those files are really not different from files without any extension: We simply don't know what they are. A completely unrelated question is what purpose they serve and if it's safe or advisable to delete them; just because a large directory tree has tons of .png or .mp4 files that doesn't mean that I know if they are useful to me, or if I could get rid of them. They might be part of a game that I like, and the game might no longer work when I remove them (or, more realistically, the game starter will simply download them again, and the disk space is consumed again).

The relevant information here is context:

What directory tree are those files in?

Is it something that I created manually? Then I should know what it is and if it's worth keeping or backing up.

Is it something that belongs to some piece of software that I use? Enter the Packages View.

Or is it a game that I downloaded, and my system's package manager doesn't know anything about it? Or an application that I built and installed manually? (make && sudo make install) Or one of those shrink-wrap-the-world package formats like "snap" or "flatpak"? Or a docker container? Or a virtual machine?

This all comes down to having context knowledge and making conscious decisions as a human. A tool like QDirStat can only deliver technical information; the user has to decide what to do with that information.

The design decision here for QDirStat is what amount of additional information can be made accessible to the user, and how useful that information might be; also, how to obtain and visualize it while maintaining good usability.

elicoten commented 4 years ago

Thanks for sharing further thoughts on this. Just to clarify a lot of what I said in my previous message was to explain the background behind the feature request, e.g.

I set out writing scripts to run find. But I think QDirStat can help to answer quite a large proportion of these question, so in a nutshell, can I delete those 40Gb of files? Well even if I could I need to find them first to be able to delete them. If not, what is the nature of them? Do they need to be backed up? Can they be compressed to save space? Is that just something I'll have to put up with?

These were meant as rhetorical questions rather than questions for you or QDirStat to answer directly.

To sum up,- I think that the feature you previously suggested about showing a treemap which includes only files that do not belong to a well-known MIME category would be immensely useful in providing the some of the contextual details necessary to begin thinking about some of these questions.

I think that suggestion of rebuilding the whole treemap for just the uncategorised files is even better than the "locate files by category" feature that I originally enquired about. And yes, by uncategorised files, I mean to include both files with no extension as well as those that have an extension which does not have a category.

shundhammer commented 4 years ago

Okay, so let me think about refining that thought further: Ignoring known MIME categories to get a tree and a treemap that shows only those files that we don't know anything about.

Side note: In the QDirStat classes, ignored files are put into a special place called attic.

For a first shot (that might be quite easy to implement - let's see), the leftover files will probably also contain executables and tons of scripts. But maybe there will be an option to also ignore files that are both executable and in one of the well-known system directories like /bin, /usr/bin, /lib, /usr/lib. That might put a whole lot of stuff out of the way.

elicoten commented 4 years ago

Thanks, that all sounds very useful

shundhammer commented 3 years ago

Moving to ideas document.