imsnif / diskonaut

Terminal disk space navigator 🔭
MIT License
2.41k stars 64 forks source link

Feature: Detect hard links #27

Open dbramucci opened 4 years ago

dbramucci commented 4 years ago

Running the following commands

mkdir tmp
ls -hs ./tmp
wget --quiet https://www.gutenberg.org/files/1661/1661-0.txt --output-document=./tmp/sherlock.txt
ln ./tmp/sherlock.txt ./tmp/sherlock-link.txt
ls -hs ./tmp

Will produce a 600 KiB file sherlock.txt and a hard link sherlock-link.txt. diskonaut currently detects these 2 as different files but deleting either file will not free any disk space (beyond a little metadata). This is similar to the problem of issue #26 where the expectation of xG of deleted data = xG of more usable space proves incorrect.

This problem is typical of hard links (see ln's behavior) but it would be cool if diskonaut could deal with them nicely.

UI idea

Put upper and lower bounds on disk-space before running the slower process of detecting the details of hard-linked files. By checking for files with a link-count of 1, you can identify non hard linked files. If you assume that all hard linked files are used outside the current folder (this assumption may be modified for the root directory perhaps) than you get a lower bound of space freed upon deletion. If you assume that all hard linked are within a directory, you get an upper bound of space freed upon deletion (this is the current behavior of diskonaut)

1G < ?G < 2.4G
imsnif commented 4 years ago

Very true. I've actually been bothered by this issue but didn't get around to deal with it. Thanks for bringing it up!

IMO the best solution would be something where we don't have to assume whether the hard links come from within the scanned folder or without (I feel that might make things complicated and create some edge cases we might not have considered).

What do you think of: giving a UI indication that the file/folder is a hard link (or has a hard link to it), and "doing the right thing" when the user deletes it? (delete all occurrences of it in the ui and only indicate the actual amount of disk space freed by it).

As for detecting hard links: is it a very performance-heavy process? I haven't tested this, but since we're getting the file metadata anyway, can't we keep some state of "which files point to this inode" in order to know this? Or am I missing something?

dbramucci commented 4 years ago

If I recall correctly,

  1. Detecting if a hard-link exists should be cheap

    Just check if the reference count is greater than 1.

  2. Finding all files among a collection that are hard links to each other is manageable

    Almost O(n) see Union-Find/Disjoin-Set over the inode ids) This is the case most similar to ours I think.

  3. Finding all hardlinks to a file can be very expensive because you may need to traverse the entire filesystem to find them all. That is, "reversing" hard links is a global operation over the filesystem, not a local one.

Here I believe the worst case is if we have 2 large directories and suppose that x1 and x2 are hard linked to each other and y and z are distinct files with each file 1MiB in size.

tmp
├── D1
│   ├── x1
│   └── y
└── D2
    ├── x2
    └── z

Then as we start computing the size of D1 we will observe that x1 is a hard linked file. We know that there are 2 files that link to x1s data. We can't find the other file yet because it would be absurdly slow to do so. Do we

We now look at tmp/D1/y and add 1MiB to the size of D1 and tmp because it is obvious that there are no hard links to y (refcount of 1).

Given that no hard links appeared to x1 in D1 we should state D1s size as 1MiB of freeable space (assuming the delete-centric pov).

Moving on to D2, we check out x2 and see it has a hard link. We look to see if it matches any pre-existing hard links

Now we look at z and see it is hard link free. We are done with D2 and we should compute that it contains 1MiB of freeable space (if we are trying to free space).

Now the potentially expensive part is here, but I don't know off the top of my head how expensive this is.

D1 is worth 1MiB and D2 is worth 1MiB but tmp which is only D1 + D2 is actually worth 3MiB. Why? If we delete it, both hard links to x1/x2 will be deleted freeing the 1MiB of disk space shared between the two.

This leads to the counter intuitive result that tmp is 3MiB of freeable space and isn't the 1MiB + 1MiB we would expect from D1 and D2.

If we take the other perspective (the space we are using if we don't break the links in a copy) we get the opposite effect. D1 and D2 both contain 2MiB of data but, due to sharing, tmp only contains 3MiB of data. But, I'm not sure how many people care about this perspective.

For our purposes here, I think the asymptotically optimal data structure would be something like

Then, (with Python-esque dictionary literals) the process would look like

Some points about the above algorithm

  1. non-hard linked files can be treated just like hard linked files, you just lose some performance.
  2. Once you've accounted for all occurrences of an inode, you don't need to pass that inode to the parent directory (i.e. D1 and D2 could've stripped inode2 and inode3 out and then we would just add the size of D1 and D2 to compensate for the loss of details)

    Of course, there's room here to experiment with multiple approaches around how much dictionary manipulation gets done and when it gets done.

  3. The algorithm as described above allocates a ton of data structures, 1 per directory. You may be able to get away with far fewer allocations (say the height of the filesystem by passing around a Vec of all unused but allocated dictionaries using something like split_first_mut to get one for the next layer a dictionary to borrow)
  4. It needs to be seen how much data should be persisted to compensate for the user traversing their files. (Space vs Time tradeoff here)
  5. Other features for Hard Links (helping the user spot related directories/files) may benefit greatly from less optimized versions of the above algorithm.

The worst case scenario, once fully accounted for inodes are pruned out, would likely be a collection of hard linked files many generations apart from their linked counter parts. If the height of the tree is h and the number of distinct files is n, then the time complexity would be O(hn), which is dangerously close to quadratic (but hopefully this case would be obscure). Visually, it looks like

tmp
├── A1
│   └── B1
│       └── C1
│           └── D1
│               └── E1
│                   └── F1
│                       └── G1
│                           └── H1
│                               ├── w1
│                               ├── x1
│                               ├── y1
│                               └── z1
└── A2
    └── B2
        └── C2
            └── D2
                └── E2
                    └── F2
                        └── G2
                            └── H2
                                ├── w2
                                ├── x2
                                ├── y2
                                └── z2
imsnif commented 4 years ago

Hey @dbramucci - thanks again for the super detailed explanation. I think I understand your concerns better now. I'm a little hesitant to go forward with the algorithm you mentioned, but maybe it's because there are some cases I haven't considered. So I'll ask it this way:

If an inode is a unique global identifier of a hard-link (as in: "I am a hard link if the inode of the file I point to has more than 1 hard link pointed at it"), why can't we use a global hashmap whose keys are inodes. We update it every time we add a file, and silently ignore any file with an inode already in this table?

We couple this with a UI change, swapping (x = Small Files) with (x = Small files / links).

Then, we are focused on "space that would be freed upon deletion", and the only edge case we don't deal with very nicely is if we have a hard link that leads outside our scanned folder (I think we can live with that).

I think this is essentially the same user-facing behaviour we have right now with soft-links (seeing as they take up no space, so we don't show them).

Please, do tell me if I'm not taking something into consideration.

What do you think?

dbramucci commented 4 years ago

Just to make sure that we aren't talking past each-other over a misconception

To my understanding, when you make a hard link from a to b, you can no longer identify which is the "original" file and which is "just a link". As far as the file system is concerned, they are both equal files. The is different from symlinks where it is clear that a is a symlink and b is a file.

This means that in my original example with the sherlock.txt, both sherlock.txt and sherlock-link.txt are files that have a hard link somewhere else. You can't say that sherlock.txt is the original and sherlock-link.txt is a hard link just by looking at the filesystem. Thus, whenever I say "hard link", I am intend for both sherlock.txt and sherlock-link.txt to be treated as "hard link"s.

If an inode is a unique global identifier of a hard-link (as in: "I am a hard link if the inode of the file I point to has more than 1 hard link pointed at it"), why can't we use a global hashmap whose keys are inodes. We update it every time we add a file, and silently ignore any file with an inode already in this table?

  1. Global implies a shared mutable resource (although there are immutable alternatives), this could limit parallel performance.
  2. It sounds like you would accumulate files as you go so we would ignore x2 in the second example I provided. Why is x2 treated as a link but x1 is treated as a file?
  3. Would it make sense to a user to see
|-------------|
|             |
|             |
|     D1 2M   |
|             |
|             |
|-------------|
|             |
|     D2 1M   |
|             |
|-------------|

Even though the two folders are identical in every way except for file names. Especially if the file traversal was non-deterministic (e.g. parallelism), then D1 and D2 could swap sizes without any files changing between them.

It would make more sense to use a ChainMap (Python example linked because I know it off the top of my head) and say that all files are links if their parents have an hard link to that same file because it would improve consistency but I feel like it wouldn't match up to any mental model or goal a user would have.

If we treat all hard links like symlinks are treated today (i.e. we ignore them from all counts and by initial discussion here, both sherlock.txt and sherlock-link.txt would be ignored), we reverse today's problem of over-estimating free-able space and instead would under-estimate it.

Consider the Sherlock example. Suppose our user uses 2 programs, one that only accepts files ending in .txt and one assuming stories end in .story. Initially the 2 files are distinct and 1.2M of space are used to store them. But one day the user realizes that both files are read only, identical and therefore the user hard links the 2 files together to save 600K of space.

With today's diskonaut, we would incorrectly say that they could save 1.2M of space by deleting that directory because we currently treat the 2 files separately.

With the "all hard links are like symlinks and won't get counted diskonaut" we would say that there are 0B of space to free here. Which is weird because if we delete the directory (I'm assuming we are looking at tmp here, not sherlock.txt and sherlock.story) we would get 600K back.

If we use the version you propose, it would work until we zoomed in to tmp and then sherlock.txt would look like it could free space but sherlock.story wouldn't. Or, it could break when a new folder of all text files gets made and new hard links are added because who knows if tmp or globaltxt would get the credit for each unique inode. This hard to predict behavior would probably be treated by users as "diskonaut gets really flakey when you add (distant) hard links to files".

I also think that hard links may frequently lead outside the users folder (i.e. a bunch of users might have a hard link in their home folder to one global virtual machine image) and the behavior could be weird for each user (details depend on implementation details).

As to edge cases for hard links. I believe that hard links have significantly fewer edge cases than symlinks do in practice because modern systems normally don't let you hard link directories. This eliminates issues like looping directory structures (which require cycle-detecting algorithms to solve). Of course, I think there are some systems that do allow hard linking directories but, that probably deserves an issue to itself and would be out of scope here.

soft-links (seeing as they take up no space, so we don't show them).

Minor detail but (even ignoring metadata) soft-links generally require some space (on my system it appears to require 4K of space)

~/tmp> touch test.txt
~/tmp> ln -s test.txt link
~/tmp> ls
link@  test.txt
~/tmp> ls -hs
total 4.0K
4.0K link@     0 test.txt

and yes, it takes more space than an empty text file. I can't find it, but I know the nix package manager had an issue where the space savings for hard links over soft-links were considered significant enough to merit a preference for hard links.

imsnif commented 4 years ago

Just to make sure that we aren't talking past each-other over a misconception

To my understanding, when you make a hard link from a to b, you can no longer identify which is the "original" file and which is "just a link". As far as the file system is concerned, they are both equal files.

Okay, that is actually a misconception that I've had. Thank you for clearing this up, I think now I understand where you're coming from.

Before talking about the algorithm and its implementation, how would you think it best to present this to the user in a simple and immediately comprehensible way? Like with the D1+D2 example, what would you do? (I think you mentioned the lower/higher bound approach above, but I didn't understand how we would see this on screen)

The reason I'm asking is that I feel if we don't find a really simple way, then we'd be damaging a lot of users at the expense of an edge case.

dbramucci commented 4 years ago

Given that the current behavior of diskonaut is to start by assuming a file is empty and growing until it reaches the full size and the most likely use-case for diskonaut is to free space on a drive I believe the most natural extension is to start by assuming all hard links would free no space (i.e. they are effectively 0 bytes) and we just track how many we see.

As we loop through the directory, we sum up all of the non linked file sizes until we get to the end where we can see if it deleting the directory would free any data (i.e. all hard links to a inode are contained within that directory) then we add that space to the space for that directory.

Users would not perceive any difference in behavior compared to today except for smaller (and more accurate) file sizes being reported. (A perceptive user may notice that hard links are always the last thing to be added to the tally by observing the rate of growth for the sum but that's besides the point).

The most confusing aspect for your average user would be that diskonaut would disagree with the sizes that other utilities (like du and file managers) would report. Both sherlock.txt and sherlock-link.txt and sherlock.story would appear to be 0 bytes each in diskonaut but du and their file manager would show 600k.

This would be confusing to anyone who doesn't understand or recognize the relevance of hard links here. Luckily, said users would be unlikely to be creating many hard links in the first place. Additionally, we could add an indication in the box that there are hard links involved like

(really rough sketch)

|-------------------------|
|      sherlock.txt 0k    |
|  hard link to 600k data |
|-------------------------|
|     sherlock.story 0k   |
|  hard link to 600k data |
|-------------------------|

or HH and add hard links to the legend (distinct but similar to small files xx).

As for the more complicated interface I suggested at the beginning, I think it makes more sense when adding in the impact of compression, (sparse files?) and file de-duplication (a feature I haven't filed an issue for yet) which can especially when taken together make it very slow to get the total real disk usage of a directory. With these features (and technically hard links, but I think they are fast enough on their own) it can be faster to over-estimate things, present an estimation to the user and then start scaling back as you run a slower process to correct the over-estimation. In this case, I would like to see that I'm looking at an estimate and to understand what the bounds are. Given that this is a fairly advanced use case, I think it would be reasonable to add a command line flag of some sort that power users could enable to get the more complicated interface. Beginners could just ignore the extra flag.

Something like -b --bounds which when enabled would change the sizes to be

8G < ? 9G < ?... until the entire directory is traversed then a process of refinement

   9G < ? < 20G
   9G < ? < 18G
   9G < ? < 14G
  10G < ? < 14G
  12G < ? < 13G
      12.3G

But, if you don't call diskonaut -b ./ then you get the simpler interface seen today.

dbramucci commented 4 years ago

As an aside, it would be nice if there was some good way to show that D1 and D2 together cover all hard links to some inode but it is hard to think of a clean ui for that. Likewise, it would be nice (but much less important) to see that D1 and D2 share any hard links in common at all and nice to see the trivial (and probably rare) case of hard linked files in the same directory illustrated.

imsnif commented 4 years ago

I'm going to suggest some ideas for the UI first, because I feel once we know what we'd like to show the user, it would be easier for us to think how to do it, and thus know what we should implement and what we won't be needing. At least, that's how it's easiest for me to work. :)

So regarding UI: How about something like this (again, rough sketch):

|-------------------------|
|   (H) sherlock.txt 0k   |
|-------------------------|
|   (H) sherlock.story 0k |
|-------------------------|
(...)
                               (H = hard link, x = Small files)

And then have the legend only appear if we actually have hard links on screen.

Then, in folders that include hard links, we'll have what we have now, plus an additional line under it (assuming there is space for it) reading: (incl. 1M of hard links shared with other folders) This is of course assuming the hard-links are only shared with other folders and not fully contained within that folder. Personally, I prefer not to have too many flags and options, but rather to try and "do the right thing for everyone" as much as possible. So in my eyes this is a good compromise between giving this information to users who need it, and not confusing those who don't. What do you think?

dbramucci commented 4 years ago

Some things to be aware of with any attempt to graphically represent hard links will have to deal with the issue that the standard mindset used to represent filesystems is that the filesystem is a tree where hard links change the filesystem into a type of non-tree directed acyclic graph. This means that most methods of visualizing trees will necessarily drop information about the hard links and I think it is handy to be cognizant what information is being lost.

On the flip side, most filesystems use links sparingly and therefore, trees form a close approximation for said filesystem and trees offer cleaner, less complicated and less clustered visualizations than generalized DAGs (directed-acyclic graphs) do.

At a first glance, your proposal demonstrates the following

Pros

Cons

Of course, we shouldn't let the perfect be the enemy of the good. I, personally, wouldn't mind your proposed solution as the current answer to the problem. It is basically the current interface, but better.

dbramucci commented 4 years ago

A handy capability would be to highlight a hard link and press some key (say ?) to list off all locations that link to that same file.

That way you could highlight `sherlock.txt, press ? and see

--------------------------------
Hard Links: ./sherlock.txt 600K
--------------------------------
* ./old-folder/sherlock.txt
* /weird-folder/stories/sherlock-collection.txt

Likewise, there are analogous popups that could be designed for folders containing hard links.

dbramucci commented 4 years ago

I'm trying to find the right terminology to distinguish between

once I have figured out some terminology I can describe what we are looking at in terms of graph theory and some relevant terminology that can be googled for finding similar situations, algorithms and visualizations.