Feature: Detect hard links

Running the following commands

mkdir tmp
ls -hs ./tmp
wget --quiet https://www.gutenberg.org/files/1661/1661-0.txt --output-document=./tmp/sherlock.txt
ln ./tmp/sherlock.txt ./tmp/sherlock-link.txt
ls -hs ./tmp

Will produce a 600 KiB file sherlock.txt and a hard link sherlock-link.txt. diskonaut currently detects these 2 as different files but deleting either file will not free any disk space (beyond a little metadata). This is similar to the problem of issue #26 where the expectation of xG of deleted data = xG of more usable space proves incorrect.

This problem is typical of hard links (see ln's behavior) but it would be cool if diskonaut could deal with them nicely.

UI idea

Put upper and lower bounds on disk-space before running the slower process of detecting the details of hard-linked files. By checking for files with a link-count of 1, you can identify non hard linked files. If you assume that all hard linked files are used outside the current folder (this assumption may be modified for the root directory perhaps) than you get a lower bound of space freed upon deletion. If you assume that all hard linked are within a directory, you get an upper bound of space freed upon deletion (this is the current behavior of diskonaut)

1G < ?G < 2.4G

Very true. I've actually been bothered by this issue but didn't get around to deal with it. Thanks for bringing it up!

IMO the best solution would be something where we don't have to assume whether the hard links come from within the scanned folder or without (I feel that might make things complicated and create some edge cases we might not have considered).

What do you think of: giving a UI indication that the file/folder is a hard link (or has a hard link to it), and "doing the right thing" when the user deletes it? (delete all occurrences of it in the ui and only indicate the actual amount of disk space freed by it).

As for detecting hard links: is it a very performance-heavy process? I haven't tested this, but since we're getting the file metadata anyway, can't we keep some state of "which files point to this inode" in order to know this? Or am I missing something?

If I recall correctly,

Detecting if a hard-link exists should be cheap

Just check if the reference count is greater than 1.
Finding all files among a collection that are hard links to each other is manageable

Almost O(n) see Union-Find/Disjoin-Set over the inode ids) This is the case most similar to ours I think.
Finding all hardlinks to a file can be very expensive because you may need to traverse the entire filesystem to find them all. That is, "reversing" hard links is a global operation over the filesystem, not a local one.

Here I believe the worst case is if we have 2 large directories and suppose that x1 and x2 are hard linked to each other and y and z are distinct files with each file 1MiB in size.

tmp
├── D1
│   ├── x1
│   └── y
└── D2
    ├── x2
    └── z

Then as we start computing the size of D1 we will observe that x1 is a hard linked file. We know that there are 2 files that link to x1s data. We can't find the other file yet because it would be absurdly slow to do so. Do we

Count x1 in the size of D1?
- If we copied D1 to a flash drive, this would be a good idea because we need enough space to store x1 especially if the hard links are broken by the copy paste.
- If we delete D1, we would recover the space for x1 iff the second hard link is also contained in D1. In this example, we wouldn't recover 2MiB by deleting D1 because the hard link is in D2.

We now look at tmp/D1/y and add 1MiB to the size of D1 and tmp because it is obvious that there are no hard links to y (refcount of 1).

Given that no hard links appeared to x1 in D1 we should state D1s size as 1MiB of freeable space (assuming the delete-centric pov).

Moving on to D2, we check out x2 and see it has a hard link. We look to see if it matches any pre-existing hard links

insert the hard part of the problem here
What exactly are pre-existing hard links?
Scan's sequential order or, all sub-directories of D2 + all preceding files in D2, i.e. are the contents of D1 included?
what exactly do we do if we answered yes to the above

Now we look at z and see it is hard link free. We are done with D2 and we should compute that it contains 1MiB of freeable space (if we are trying to free space).

Now the potentially expensive part is here, but I don't know off the top of my head how expensive this is.

D1 is worth 1MiB and D2 is worth 1MiB but tmp which is only D1 + D2 is actually worth 3MiB. Why? If we delete it, both hard links to x1/x2 will be deleted freeing the 1MiB of disk space shared between the two.

This leads to the counter intuitive result that tmp is 3MiB of freeable space and isn't the 1MiB + 1MiB we would expect from D1 and D2.

If we take the other perspective (the space we are using if we don't break the links in a copy) we get the opposite effect. D1 and D2 both contain 2MiB of data but, due to sharing, tmp only contains 3MiB of data. But, I'm not sure how many people care about this perspective.

For our purposes here, I think the asymptotically optimal data structure would be something like

A HashMap/TreeMap of
- Directories to
- HashMap/TreeMap to
  - inode ids (the identifier for hardlinks)
  - number of references within directory (and subdirectories)

Then, (with Python-esque dictionary literals) the process would look like

tmp - {'tmp': {}}
tmp/D1- {'tmp': {}, 'tmp/D1': {}}
tmp/D1/x1- {'tmp': {}, 'tmp/D1': {'inode1': 1}}

Here I say that x1 and x2 have inode1 y has inode2 and z has inode3.
tmp/D1/y- {'tmp': {}, 'tmp/D1': {'inode1': 1, 'inode2': 1}}
Done with tmp/D1
- 'tmp/D1 has 1 out of the 2 occurrences of inode1, don't add the size of inode1 here.
- 'tmp/D1 has 1 out of 1 occurrences of inode2 so add this to the freeable space of tmp/D1.
tmp/D2- {'tmp': {}, 'tmp/D1': {'inode1': 1, 'inode2': 1}, 'tmp/D2': {}}
tmp/D2/x2- {'tmp': {}, 'tmp/D1': {'inode1': 1, 'inode2': 1}, 'tmp/D2': {'inode1': 1}}
tmp/D2/z- {'tmp': {}, 'tmp/D1': {'inode1': 1, 'inode2': 1}, 'tmp/D2': {'inode1': 1, 'inode3': 1}}
Done with tmp/D2
- 'tmp/D2 has 1 out of the 2 occurrences of inode1, don't add the size of inode1 here.
- 'tmp/D2 has 1 out of 1 occurrences of inode3 so add this to the freeable space of tmp/D2.
Done with the elements of tmp, sum up all of the elements in the sub-directories
- tmp/D2/z- {'tmp': {}, 'tmp/D1': {'inode1': 1, 'inode2': 1}, 'tmp/D2': {'inode1': 1, 'inode3': 1}}
- add up 'inode1' - tmp/D2/z- {'tmp': {'inode1': 2}, 'tmp/D1': {'inode1': 1, 'inode2': 1}, 'tmp/D2': {'inode1': 1, 'inode3': 1}}
- add up 'inode2' - tmp/D2/z- {'tmp': {'inode1': 2, 'inode2': 1}, 'tmp/D1': {'inode1': 1, 'inode2': 1}, 'tmp/D2': {'inode1': 1, 'inode3': 1}}
- add up 'inode3' - tmp/D2/z- {'tmp': {'inode1': 2, 'inode2': 1, 'inode3': 1}, 'tmp/D1': {'inode1': 1, 'inode2': 1}, 'tmp/D2': {'inode1': 1, 'inode3': 1}}
- So all 2 of inode1, 1 of inode2, 1 of inode3 are accounted for so we have 3MiB of free-able space in tmp.

Some points about the above algorithm

non-hard linked files can be treated just like hard linked files, you just lose some performance.
Once you've accounted for all occurrences of an inode, you don't need to pass that inode to the parent directory (i.e. D1 and D2 could've stripped inode2 and inode3 out and then we would just add the size of D1 and D2 to compensate for the loss of details)

Of course, there's room here to experiment with multiple approaches around how much dictionary manipulation gets done and when it gets done.
The algorithm as described above allocates a ton of data structures, 1 per directory. You may be able to get away with far fewer allocations (say the height of the filesystem by passing around a Vec of all unused but allocated dictionaries using something like split_first_mut to get one for the next layer a dictionary to borrow)
It needs to be seen how much data should be persisted to compensate for the user traversing their files. (Space vs Time tradeoff here)
Other features for Hard Links (helping the user spot related directories/files) may benefit greatly from less optimized versions of the above algorithm.

The worst case scenario, once fully accounted for inodes are pruned out, would likely be a collection of hard linked files many generations apart from their linked counter parts. If the height of the tree is h and the number of distinct files is n, then the time complexity would be O(hn), which is dangerously close to quadratic (but hopefully this case would be obscure). Visually, it looks like

tmp
├── A1
│   └── B1
│       └── C1
│           └── D1
│               └── E1
│                   └── F1
│                       └── G1
│                           └── H1
│                               ├── w1
│                               ├── x1
│                               ├── y1
│                               └── z1
└── A2
    └── B2
        └── C2
            └── D2
                └── E2
                    └── F2
                        └── G2
                            └── H2
                                ├── w2
                                ├── x2
                                ├── y2
                                └── z2

Hey @dbramucci - thanks again for the super detailed explanation. I think I understand your concerns better now. I'm a little hesitant to go forward with the algorithm you mentioned, but maybe it's because there are some cases I haven't considered. So I'll ask it this way:

If an inode is a unique global identifier of a hard-link (as in: "I am a hard link if the inode of the file I point to has more than 1 hard link pointed at it"), why can't we use a global hashmap whose keys are inodes. We update it every time we add a file, and silently ignore any file with an inode already in this table?

We couple this with a UI change, swapping (x = Small Files) with (x = Small files / links).

Then, we are focused on "space that would be freed upon deletion", and the only edge case we don't deal with very nicely is if we have a hard link that leads outside our scanned folder (I think we can live with that).

I think this is essentially the same user-facing behaviour we have right now with soft-links (seeing as they take up no space, so we don't show them).

Please, do tell me if I'm not taking something into consideration.

What do you think?

Just to make sure that we aren't talking past each-other over a misconception

To my understanding, when you make a hard link from a to b, you can no longer identify which is the "original" file and which is "just a link". As far as the file system is concerned, they are both equal files. The is different from symlinks where it is clear that a is a symlink and b is a file.

This means that in my original example with the sherlock.txt, both sherlock.txt and sherlock-link.txt are files that have a hard link somewhere else. You can't say that sherlock.txt is the original and sherlock-link.txt is a hard link just by looking at the filesystem. Thus, whenever I say "hard link", I am intend for both sherlock.txt and sherlock-link.txt to be treated as "hard link"s.

If an inode is a unique global identifier of a hard-link (as in: "I am a hard link if the inode of the file I point to has more than 1 hard link pointed at it"), why can't we use a global hashmap whose keys are inodes. We update it every time we add a file, and silently ignore any file with an inode already in this table?

Global implies a shared mutable resource (although there are immutable alternatives), this could limit parallel performance.
It sounds like you would accumulate files as you go so we would ignore x2 in the second example I provided. Why is x2 treated as a link but x1 is treated as a file?
Would it make sense to a user to see

|-------------|
|             |
|             |
|     D1 2M   |
|             |
|             |
|-------------|
|             |
|     D2 1M   |
|             |
|-------------|

Even though the two folders are identical in every way except for file names. Especially if the file traversal was non-deterministic (e.g. parallelism), then D1 and D2 could swap sizes without any files changing between them.

It would make more sense to use a ChainMap (Python example linked because I know it off the top of my head) and say that all files are links if their parents have an hard link to that same file because it would improve consistency but I feel like it wouldn't match up to any mental model or goal a user would have.

If we treat all hard links like symlinks are treated today (i.e. we ignore them from all counts and by initial discussion here, both sherlock.txt and sherlock-link.txt would be ignored), we reverse today's problem of over-estimating free-able space and instead would under-estimate it.

Consider the Sherlock example. Suppose our user uses 2 programs, one that only accepts files ending in .txt and one assuming stories end in .story. Initially the 2 files are distinct and 1.2M of space are used to store them. But one day the user realizes that both files are read only, identical and therefore the user hard links the 2 files together to save 600K of space.

With today's diskonaut, we would incorrectly say that they could save 1.2M of space by deleting that directory because we currently treat the 2 files separately.

With the "all hard links are like symlinks and won't get counted diskonaut" we would say that there are 0B of space to free here. Which is weird because if we delete the directory (I'm assuming we are looking at tmp here, not sherlock.txt and sherlock.story) we would get 600K back.

If we use the version you propose, it would work until we zoomed in to tmp and then sherlock.txt would look like it could free space but sherlock.story wouldn't. Or, it could break when a new folder of all text files gets made and new hard links are added because who knows if tmp or globaltxt would get the credit for each unique inode. This hard to predict behavior would probably be treated by users as "diskonaut gets really flakey when you add (distant) hard links to files".

I also think that hard links may frequently lead outside the users folder (i.e. a bunch of users might have a hard link in their home folder to one global virtual machine image) and the behavior could be weird for each user (details depend on implementation details).

As to edge cases for hard links. I believe that hard links have significantly fewer edge cases than symlinks do in practice because modern systems normally don't let you hard link directories. This eliminates issues like looping directory structures (which require cycle-detecting algorithms to solve). Of course, I think there are some systems that do allow hard linking directories but, that probably deserves an issue to itself and would be out of scope here.

soft-links (seeing as they take up no space, so we don't show them).

Minor detail but (even ignoring metadata) soft-links generally require some space (on my system it appears to require 4K of space)

~/tmp> touch test.txt
~/tmp> ln -s test.txt link
~/tmp> ls
link@  test.txt
~/tmp> ls -hs
total 4.0K
4.0K link@     0 test.txt

and yes, it takes more space than an empty text file. I can't find it, but I know the nix package manager had an issue where the space savings for hard links over soft-links were considered significant enough to merit a preference for hard links.

Just to make sure that we aren't talking past each-other over a misconception

To my understanding, when you make a hard link from a to b, you can no longer identify which is the "original" file and which is "just a link". As far as the file system is concerned, they are both equal files.

Okay, that is actually a misconception that I've had. Thank you for clearing this up, I think now I understand where you're coming from.

Before talking about the algorithm and its implementation, how would you think it best to present this to the user in a simple and immediately comprehensible way? Like with the D1+D2 example, what would you do? (I think you mentioned the lower/higher bound approach above, but I didn't understand how we would see this on screen)

The reason I'm asking is that I feel if we don't find a really simple way, then we'd be damaging a lot of users at the expense of an edge case.

Given that the current behavior of diskonaut is to start by assuming a file is empty and growing until it reaches the full size and the most likely use-case for diskonaut is to free space on a drive I believe the most natural extension is to start by assuming all hard links would free no space (i.e. they are effectively 0 bytes) and we just track how many we see.

As we loop through the directory, we sum up all of the non linked file sizes until we get to the end where we can see if it deleting the directory would free any data (i.e. all hard links to a inode are contained within that directory) then we add that space to the space for that directory.

Users would not perceive any difference in behavior compared to today except for smaller (and more accurate) file sizes being reported. (A perceptive user may notice that hard links are always the last thing to be added to the tally by observing the rate of growth for the sum but that's besides the point).

The most confusing aspect for your average user would be that diskonaut would disagree with the sizes that other utilities (like du and file managers) would report. Both sherlock.txt and sherlock-link.txt and sherlock.story would appear to be 0 bytes each in diskonaut but du and their file manager would show 600k.

This would be confusing to anyone who doesn't understand or recognize the relevance of hard links here. Luckily, said users would be unlikely to be creating many hard links in the first place. Additionally, we could add an indication in the box that there are hard links involved like

(really rough sketch)

|-------------------------|
|      sherlock.txt 0k    |
|  hard link to 600k data |
|-------------------------|
|     sherlock.story 0k   |
|  hard link to 600k data |
|-------------------------|

or HH and add hard links to the legend (distinct but similar to small files xx).

As for the more complicated interface I suggested at the beginning, I think it makes more sense when adding in the impact of compression, (sparse files?) and file de-duplication (a feature I haven't filed an issue for yet) which can especially when taken together make it very slow to get the total real disk usage of a directory. With these features (and technically hard links, but I think they are fast enough on their own) it can be faster to over-estimate things, present an estimation to the user and then start scaling back as you run a slower process to correct the over-estimation. In this case, I would like to see that I'm looking at an estimate and to understand what the bounds are. Given that this is a fairly advanced use case, I think it would be reasonable to add a command line flag of some sort that power users could enable to get the more complicated interface. Beginners could just ignore the extra flag.

Something like -b --bounds which when enabled would change the sizes to be

8G < ? 9G < ?... until the entire directory is traversed then a process of refinement

   9G < ? < 20G
   9G < ? < 18G
   9G < ? < 14G
  10G < ? < 14G
  12G < ? < 13G
      12.3G

But, if you don't call diskonaut -b ./ then you get the simpler interface seen today.

As an aside, it would be nice if there was some good way to show that D1 and D2 together cover all hard links to some inode but it is hard to think of a clean ui for that. Likewise, it would be nice (but much less important) to see that D1 and D2 share any hard links in common at all and nice to see the trivial (and probably rare) case of hard linked files in the same directory illustrated.

I'm going to suggest some ideas for the UI first, because I feel once we know what we'd like to show the user, it would be easier for us to think how to do it, and thus know what we should implement and what we won't be needing. At least, that's how it's easiest for me to work. :)

So regarding UI: How about something like this (again, rough sketch):

|-------------------------|
|   (H) sherlock.txt 0k   |
|-------------------------|
|   (H) sherlock.story 0k |
|-------------------------|
(...)
                               (H = hard link, x = Small files)

And then have the legend only appear if we actually have hard links on screen.

Then, in folders that include hard links, we'll have what we have now, plus an additional line under it (assuming there is space for it) reading: (incl. 1M of hard links shared with other folders) This is of course assuming the hard-links are only shared with other folders and not fully contained within that folder. Personally, I prefer not to have too many flags and options, but rather to try and "do the right thing for everyone" as much as possible. So in my eyes this is a good compromise between giving this information to users who need it, and not confusing those who don't. What do you think?

Some things to be aware of with any attempt to graphically represent hard links will have to deal with the issue that the standard mindset used to represent filesystems is that the filesystem is a tree where hard links change the filesystem into a type of non-tree directed acyclic graph. This means that most methods of visualizing trees will necessarily drop information about the hard links and I think it is handy to be cognizant what information is being lost.

On the flip side, most filesystems use links sparingly and therefore, trees form a close approximation for said filesystem and trees offer cleaner, less complicated and less clustered visualizations than generalized DAGs (directed-acyclic graphs) do.

At a first glance, your proposal demonstrates the following

Pros

Natural extension to existing interface
Simple and easy to understand

Cons

Drops all information about where the hard links go
- Can't identify which folders have all of their hard links in the present working directory
  
  In practice, this isn't too significant of a problem because if you go up .. you will see the total size contained within that directory. For a program as interactive as diskonaut this isn't too significant of an issue to deal with.
- Can't tell which files/folders share data and how much in common.
  
  In practice, this could be much more annoying. Consider the situation where you have a folder containing many sub-directories and 2 of those directories have hard links to the same big (say 1G) file. Diskonaut would display that the parent directory contains 1.1G of space but, going within to prune that would only show many 10K folders, the anomalous 2 containing a single 1G file (the most dramatic available change) would appear no different to any other folders containing small hard links to 1k files. The only way to find which files contain the useful hard links would be to go 1 by 1 and check the (incl. 1M of hard links shared with other folders) of each to see if any end up much larger and then do more detective work from there.
Drops most information about the size of the hard links not completely contained within a folder

Although the (incl. 1M of hard links shared with other folders) suggestion mitigates this by recovering the total amount. This still loses out on information like "1G could be freed by deleting 1 external hard link" and "to recover all 1.1G of space it would take the deletion of 12 thousand hard links". In general there is a notion of "how hard it is to free a file". Some files shouldn't be deleted no matter how few hard links they have (i.e. you would almost never delete /bin/bash and it would probably make sense to ignore it) whereas other files may have many links but remain easy to delete (or empty) like a log file.

Of course, we shouldn't let the perfect be the enemy of the good. I, personally, wouldn't mind your proposed solution as the current answer to the problem. It is basically the current interface, but better.

A handy capability would be to highlight a hard link and press some key (say ?) to list off all locations that link to that same file.

That way you could highlight `sherlock.txt, press ? and see

--------------------------------
Hard Links: ./sherlock.txt 600K
--------------------------------
* ./old-folder/sherlock.txt
* /weird-folder/stories/sherlock-collection.txt

Likewise, there are analogous popups that could be designed for folders containing hard links.

I'm trying to find the right terminology to distinguish between

Inodes (i.e. any element in the filesystem)
Directories
Files including hard links
Hard links
Files excluding hard links

once I have figured out some terminology I can describe what we are looking at in terms of graph theory and some relevant terminology that can be googled for finding similar situations, algorithms and visualizations.

imsnif / diskonaut

Feature: Detect hard links #27

UI idea