Export statistics results in CSV, JPG or PDF files

livmion commented 3 years ago

Dear Stefan,

I would like to know if it could be possible to add a feature to export the statistics displayed in QDirStat to CSV, JPG or PDF file formats.

I am currently working for a research institute and we are looking for a user friendly alternative to TreeSize on Linux. Your application is great and would satisfy all of our needs, but we also need to compile reports with statistics attached. For me it is not a big problem to use ls or du, but for my colleagues is different and they would appreciate for sure such a solution.

Thank you in advance for your attention,

Best Regards

Alessio Paonessa

shundhammer commented 3 years ago

This depends on what the user is actually interested in.

Exporting to simple text format and to CSV is easy; but export exactly what? A tree view like in QDirStat contains a ton of information, but it becomes accessible only with most of it hidden; i.e. most branches of that tree are collapsed. If everything would be expanded at once, the list would become looong, producing a gazillion lines in a report file; or dozens of pages in a PDF.

It's all very easy as long as all tree branches are neatly collapsed:

src-qdirstat-collapsed

But if all the branches are expanded so you can do any processing on the data with other tools, it becomes unwieldy to the point of being pretty unusable - look at the scroll bars to get an impression how long the whole thing becomes:

src-qdirstat-expand-to-level-5

...and this is only a pretty small tree with only 712 files and directories in total. Imagine what it looks like for nontrivial directory trees.

Even if only one tree level more than just the toplevel is expanded for all branches, it's already barely usable:

src-qdirstat-expand-to-level-2

It becomes usable only when manually opening only those branches that you are currently interested in:

src-qdirstat-expand-src

And this is where the power of a tree view lies: In hiding most of the information because it's typically irrelevant to the task at hand. Those collapsible and expandable trees are useful because they allow you to selectively expand and collapse individual branches, not dumping just everything on you.

But when exporting the tree to a file or to a printable document like a PDF, there is little choice other than just exporting everything, i.e. expanding all branches and hand over the responsibility to the user (or to the next software the user chooses to use).

shundhammer commented 3 years ago

Having said that, there already is an exporter tool: The qdirstat-cache-writer.

With the -l (long format) command line option, the result is really easy to parse because every line contains the complete path:

qdirstat-cache-writer -l ~/src/qdirstat /tmp/src-qdirstat.txt

head -n 30 /tmp/src-qdirstat.txt

[qdirstat 1.0 cache file]
# Generated by qdirstat-cache-writer
# Do not edit!
#
# Type  path            size    mtime           <optional fields>

D /work/home/sh/src/qdirstat    4096    0x615376c7
# Device: /dev/nvme0n1p5

F /work/home/sh/src/qdirstat/.qmake.stash   739 0x612abb77
F /work/home/sh/src/qdirstat/.gitignore 22  0x5fb9043d
F /work/home/sh/src/qdirstat/LICENSE    18092   0x6151907d
F /work/home/sh/src/qdirstat/qdirstat.pro.user  20401   0x6151907d
F /work/home/sh/src/qdirstat/qdirstat.pro   832 0x6151907d
F /work/home/sh/src/qdirstat/README.md  44777   0x615376c7
F /work/home/sh/src/qdirstat/Makefile   35198   0x61519093
D /work/home/sh/src/qdirstat/src    12288   0x615376da
F /work/home/sh/src/qdirstat/src/OpenDirDialog.cpp  9284    0x61209c5b
F /work/home/sh/src/qdirstat/src/ui_mime-category-config-page.h 12390   0x615376d1
F /work/home/sh/src/qdirstat/src/HeaderTweaker.h    4988    0x5fb9043d
F /work/home/sh/src/qdirstat/src/SizeColDelegate.cpp    6789    0x61004835
F /work/home/sh/src/qdirstat/src/SystemFileChecker.h    1380    0x5fb9043d
F /work/home/sh/src/qdirstat/src/ui_filesystems-window.h    5155    0x612abb7a
F /work/home/sh/src/qdirstat/src/PkgQuery.cpp   4632    0x5fb9043d
F /work/home/sh/src/qdirstat/src/UnreadableDirsWindow.cpp   6967    0x610fa3d0
F /work/home/sh/src/qdirstat/src/file-type-stats-window.ui  3007    0x5fb9043d
F /work/home/sh/src/qdirstat/src/PercentileStats.cpp    4394    0x5fb9043d
F /work/home/sh/src/qdirstat/src/History.h  5064    0x6103d78e
F /work/home/sh/src/qdirstat/src/Cleanup.h  14285   0x6151907d

The file format is speficied here:

https://github.com/shundhammer/qdirstat/blob/master/doc/cache-file-format.txt

Notice that this contains only the plain data, no aggregated / accumulated values; i.e. no sums per directory branch, no oldest / newest timestamp per branch etc.

shundhammer commented 3 years ago

As for exporting to a pixel image format, the pedestrian way is to create a screenshot of what you are currently seeing; all Linux desktops support that in some ways, either screenshotting the whole screen or just the current window, depending on key combination. It's usually the PrintScreen key with or without any of the Alt, Ctrl etc. modifier keys.

If a user only wants the treemap (the colored rectangles in the bottom part), there is also a simple solution: Just drag the divider separating the tree view from the treemap all the way up, and the tree view disappears; and then hit PrintScreen:

treemap

(Hitting F9 and F9 again restores the usual layout with both views visible again)

This is also the best possible resolution for the treemap since this rendering is strictly pixel-based; exporting to a vector image format like SVG would not improve it in any way.

shundhammer commented 3 years ago

So, what is the use case?

Wishing for CSV implies further processing in a spreadsheet like MS Excel or LibreOffice Calc. But spreadsheets are matrix-oriented; they fail miserably with everything hierarchical, i.e. tree-based. You'd have to do special tricks that always make assumptions about the deepest nesting level of a tree, and that invariably results in kludges; this approach is very limiting.

What kind of processing are you thinking of? Please name some concrete examples so we have a basis for discussion.

livmion commented 3 years ago

Hi Stefan,

first of all thank you for your answer.

Unfortunately I am not in charge and I cannot choose whether something makes sense or not. I have to just follow the rules and in our data management protocol we need a very long list with all files in CSV or PDF format, before we are going to archive them in a long term repository. I think anyway it is a good idea to have an emergency list when something goes wrong.

The list actually looks like this in CSV:

TreeSize Professional Bericht, 01.10.2020  15:05
  H:\  auf  [PS-05319]
Laufwerk: H:\      Größe: 1,8 TB      Belegt: 251,9 GB      Frei: 1,6 TB     

Name;Absoluter Pfad;Größe;Belegt;Dateien;Verzeichnisse;Prozent (Belegt);Letzte Änderung;Letzter Zugriff;Besitzer;Typ;Berechtigungen;Geerbte Berechtigungen;Eigene Berechtigungen;Autor;SHA256-Prüfsumme
"2020_10";"H:\2020_10\";180,9 GB;220,2 GB;91.503;1.876;87,4 %;30.09.2020;30.09.2020;"Jeder";"Ordner";"Jeder: Vollzugriff";"Jeder: Vollzugriff";"Jeder: Vollzugriff";"";""
"DFG_Gelehrtenbriefe";"H:\2020_10\DFG_Gelehrtenbriefe\";66,2 GB;67,9 GB;5.192;704;30,8 %;30.09.2020;30.09.2020;"Jeder";"Ordner";"Jeder: Vollzugriff";"Jeder: Vollzugriff";"Jeder: Vollzugriff";"";""
"LOTTO_28";"H:\2020_10\DFG_Gelehrtenbriefe\LOTTO_28\";45,3 GB;46,4 GB;3.472;449;68,4 %;30.09.2020;30.09.2020;"Jeder";"Ordner";"Jeder: Vollzugriff";"Jeder: Vollzugriff";"Jeder: Vollzugriff";"";""
"TIFF_JPG";"H:\2020_10\DFG_Gelehrtenbriefe\LOTTO_28\TIFF_JPG\";45,3 GB;46,4 GB;3.472;448;100,0 %;30.09.2020;30.09.2020;"Jeder";"Ordner";"Jeder: Vollzugriff";"Jeder: Vollzugriff";"Jeder: Vollzugriff";"";""
"A-II-RomA-BraE-001";"H:\2020_10\DFG_Gelehrtenbriefe\LOTTO_28\TIFF_JPG\A-II-RomA-BraE-001\";614,4 MB;626,5 MB;44;0;1,3 %;23.06.2020;30.09.2020;"Jeder";"Ordner";"Jeder: Vollzugriff";"Jeder: Vollzugriff";"Jeder: Vollzugriff";"";""

Of course is more manageable in Calc. What we really need in it is the file name, file path, dimension, percentage, file type and possibly creation and modification data. Your command looks great and it is perfectly organised, but for a person who does not have any command line knowledge, it is a bit scary. A great solution would be to have a simple menu entry like: export → CSV.

Usually we attach two other files: a PDF file with all aggregated data for file type and format; an histogram where all data are showed in age classes, e.g. 1 year old, 6 months old etc…, and how much space they are using. You can find two examples here attached: TreeSize_Balkanendyagramm_2020_10.pdf; TreeSize_Dateitypen_2020_10.pdf.

I have already seen you have similar data displayed inside the software. It is not so important to have an identical graphical output, but it would important to attach similar data, where information is organised like in the attached files.

Do you think would it be possible to export something like this from your software?

Thank you again for your help,

Best Regards

Alessio Paonessa

shundhammer commented 3 years ago

Both types of information are available, albeit in a slightly different form:

File Age Statistics (F4)

file-age

File Type Statistics (F3)

file-types

Caveat

Of course, those are screenshots; i.e.

they are limited by the screen size; if there is more content than fits on the screen, part of it would be scrolled out of scope.
There is no way to process any of that any further with any scripts or a spreadsheet application like Excel or LibreOffice Calc.
It is obvious that they are screenshots since they contain window manager borders and buttons.

Furthermore, the "other" file types are limited to the top 20; any more that don't belong to one of the configured MIME categories (which you can customize, however) are omitted.

livmion commented 3 years ago

Exactly, that is my problem, if the list is too long I will not get a complete result. Is there any way to work around this limitations, i.e. to have an exportable complete list and graphical outputs of the two statistical analysis?

Thank you again

Best Regards

Alessio

shundhammer commented 3 years ago

Let me think about it.

If there is a generic solution that is not insanely complicated, and if there is a reasonable way to make it fit nicely into the GUI without adding confusion for the average user, I am all for it. But I want to avoid cluttering each of those view windows with buttons that are rarely needed; each additional button adds to the complexity of the user interface.

Maybe (just maybe) it's time for a hamburger button / menu in those dialogs to put those actions in; that would make it reasonably discoverable, yet not too obtrusive. This might take actions to expand or collapse all (toplevel?) items and an "Export" submenu with options to export as plain text or CSV.

My initial idea is to make it export the view that you are currently seeing, just the same way as items are currently expanded or collapsed. It would use the existing data models and do that in a generic way; so this could be added to any of the existing dialogs that use a tree view.

Caveat: So far, I don't make any promises beyond giving it some serious thought. ;-)

livmion commented 3 years ago

Your proposal seems to me already a great improve, thank you again for thinking about it. These features would allow us to switch to a complete open source system in our archiving pipeline.

Unfortunately I am no programmer, only a digital humanist, and I cannot help you in the hard work. Let me know if I can support you in other ways.

shundhammer commented 3 years ago

Verdict: No

I did some experiments, and I took an in-depth look into the code; and the result is that I have to disappoint you: No, this can't be reasonably done.

Why Not?

QDirStat is a GUI-centric application. While the data are carefully kept separate from the presentation, it is in fact a whole new presentation that you are asking for: In file format, no matter if (well-formatted) plain text or CSV.

QDirStat uses Qt classes for the presentation part; they already have a considerable abstraction layer to keep the logic layers apart. At the core, they use a QAbstractItemModel which is the base class for QDirStat's DirTreeModel that in turn uses a DirTree in-memory representation of the relevant data.

Responsibilities are split between those model classes and Qt's view classes (a QTreeView-derived class in this case), and many things are abstracted so the application doesn't have to deal with all the gory details; such as which columns are visible and which are not, the order of those columns (you can rearrange them interactively), the scroll positions in both dimensions, which tree branches are expanded and which ones are collapsed, the sort order (by which column and ascending vs. descending).

While it is possible with enough trickery to break all those software abstraction layers, doing so is really violating the abstraction levels; and this is asking for trouble because some of those things are officially accessible from the outside (i.e. they are part of a documented API), but some are not (i.e. exploiting undocumented Qt features).

As a result, attempting to replicate all that so an exported file looks very much like what you see on the screen is incredibly hard, and it might easily break between different (even minor!) versions of the Qt libs.

Yet, trying to mimic the on-screen presentation would be the only reasonably useful way to export data: You would get the columns that you see on the screen in the order that you see on the screen, with the tree branches expanded that you see expanded on the screen. The only difference would be that you would no longer be limited to the screen size, so the exported file would contain everything that you could see if you had an infinitely large screen.

The only reasonable alternative would be to simply export everything, always resulting in a huge file.

Also, the export formats would have inherent limitations:

CSV is matrix-oriented, just like a spreadsheet. It does not have any concept of hierarchies a.k.a. tree levels a.k.a. indentation. So this would only ever work as a kludge:
- For the file tree, it would simply ignore all tree levels / indentation, and the consumer of that file would have to take are of reconstructing the tree by the complete path name which would be required to be in one of the first columns. Yikes. Try doing that in Excel or LibreOffice Calc! AFAICS this is next to impossible or completely impossible.
- Adding empty cells at the left as a placeholder for indentation levels would quickly overwhelm any automated processing in a spreadsheet; that is too much abstraction for applications like Excel or LibreOffice Calc.
- Plain text / pretty text could use blanks for indentation; e.g., 2, 3 or 4 blanks per indentation level. That would make the tree at least somewhat recognizable.
Column width would be a problem for any text format (not for CSV). The nice auto-adjusting columns that the tree widget has would have to be replicated; and in case some columns get excessively wide due to very wide content, it would have to do something intelligent. Yikes.
Sorting is handled in large parts by the tree widget; the data models just provide a comparison method for two items and for each column. For an export feature, this would have to be replicated.

So, all things considered, this would result in a lot of custom-written code that would in large parts duplicate functionality that is otherwise done by Qt in the Qt widgets and data model classes; that would pretty much defeat the purpose of using a well-tested and well-maintained library like Qt.

So this feature would result in a lot of very seldom-used code that is also not very well-tested and used only by a very small number of users; a sure recipe for bit rot.

It's also not very aligned with the original purpose of QDirStat and with the vision behind it: A highly interactive tool that gives you up-to-date data so that you can act upon that information immediately; in many cases using the built-in cleanup methods to delete files or directories, to compress directories etc. plus whatever custom cleanup actions users may configure themselves.

QDirStat gives you a snapshot of filesystem information as it was at the moment of reading; but a moment later, that information may already be obsolete because filesystems can (and do!) change all the time. It's all very volatile, just a snapshot in time.

Exporting such a snapshot of information is a very exotic use case. I do acknowledge your specific use case, but please understand that this is a fringe case; only a very small number of users would benefit from such a feature.

Yet, this feature would require considerable code with considerable duplication of functionality, as mentioned above; and since only very few users would ever use it, it would also be code that can be considered pretty dead code (not completely dead, but also not very alive), and any change in the (Qt or system) environment would not be noticed quickly; so it would be poorly maintained code. And that is what bit rot is all about: Code that once worked, but all the time keeps getting more and more hiccups and flaws, up to a point where it is more of a burden than a benefit; for anyone, users as well as developers.

Dead or dying code adds to a software project's technical debt, and knowingly accumulating more technical debt is always a bad idea.

So, sorry, but no, I am convinced that it's not a good idea to add this.

shundhammer commented 3 years ago

Having said all that, taking such a snapshot in time and loading it again later is exactly what the Write to Cache File and Read Cache File operations in the File menu do (and also the qdirstat-cache-writer script); but using QDirStat with all its features to load it, so all views (File Type Statistics, File Size Statistics, File Age Statistics) are available; not being limited by why a spreadsheet program can do with the data.

Yes, I know, that is little consolation in your case where you have to provide the data in a format that is required by higher authority. Still, I wanted to mention it for others reading this thread.

shundhammer / qdirstat