selkhateeb / hardlink

a simple command-line utility that implements hardlinks on Mac OsX
489 stars 46 forks source link

`rm -rf` on hardlink deletes original files #34

Open ORESoftware opened 7 years ago

ORESoftware commented 7 years ago

It appears that

rm -rf on hard link deletes original files

this is dangerous and on Linux if you rm the hard link the original file is still intact.

Can you confirm / deny this behavior with your lib on MacOS?

BenjaminHCCarr commented 7 years ago

@ORESoftware this is the intended/desired behavior on UNIX

All hardlinks point to the same inode so the same spot on the disk. see: http://www.farhadsaberi.com/linux_freebsd/2010/12/hard-link-soft-symbolic-links.html https://www.freebsd.org/cgi/man.cgi?query=ln

If your linux distro is deviating from this it is not following the UNIX standard.

Hardlinks and symlinks act differently. If you delete a symlink eg: ln -s $source $target and then rm $target, you will still have $source, but the symlink is just a moveable pointer.

Often with symlinks if you delete $source you will end up with $target symlinks lying around "dead"


So yes, I can confirm deleting a hardlink deletes the inode so thus the original data. This is the desired behavior though.

ORESoftware commented 7 years ago

@BenjaminHCCarr @selkhateeb is there a way to 'remove/undo the hardlink' without deleting the original files?

Do you know if hln will work on Linux? Or just MacOS?

mhelvens commented 7 years ago

@BenjaminHCCarr: Unix standard? I don't think that's true. I don't know about FreeBSD, but both on my Linux box and my Macbook, deleting one of the references of a hard link (created with ln) leaves the others intact. Deleting the last reference deletes the inode. It uses reference counting.

I'd love to get that functionality here too.

Swivelgames commented 9 months ago

This is a bit of a necropost, but I wanted to put this out there since this is coming up in Google searches.

rm -rf is working as expected. For links, you want unlink

@ORESoftware The idiomatic way would be to use unlink, however I'm not sure if that applies to how this repo achieves hardlinked directories, and there are protections in place that try to prevent you from unlinking directories. It is expected that rm -rf will delete the directory and its contents, by nature of how the command works. -r works by first specifically purging files recursively until the directory is empty, and then deleting the directory from the filesystem.


@mhelvens Not for the contents of a directory, if the directory itself is unlinked, but not recursively. The behavior that @BenjaminHCCarr is describing is exactly correct, actually. And that can be disastrous on a larger scale, which is why hardlinked directories are generally discouraged.

Deep-dive into why Hardlinked Directories are difficult and dangerous ## Directories are just files In Unix, every file or directory is a "hardlink". In fact, in the actual physical data structure in `ext`-based filesystems, even Directories are just files, and their contents are just a map of `filenames` to their respective `inodes`: ``` foo,143927 bar,127694 ``` This would represent a directory containing two hardlinks: `foo` and `bar`. Either `foo` or `bar` could be a real file (in the traditional sense) or another directory. In either case, they're treated the same. The `inode` contains the metadata for the file including the type of file (like if its a true file or if its a directory), the permissions (or mode), the location of the data blocks on the drive where the contents are stored, and a reference counter that counts how many directories the inode is referenced in. `unlink` effectively just **deletes the entry from the directory** its in and decrements the reference counter (in fact, when trying to find references after writing this, I found that this is explicitly [how IBM describes the `unlink` command](https://www.ibm.com/docs/en/zos/2.1.0?topic=functions-unlink-remove-directory-entry)) So: ``` unlink foo ``` Would result in: ```diff - foo,143927 bar,127694 ``` The process then checks to see if the reference counter is `0` for that specific `inode` (in this example `143927`). If it is `0`, then we can assume that no other directories are pointing to it, and then the blocks on the drive that it points to are freed for use by new files. ## Why Hardlinked Directories are so difficult > So, if we have a hardlinked directory, we don't want to recursively it and its subfiles. We simply want to remove the `pointer` to that directory in that particular location. In fact, one of the reasons why Hardlinked Directories are avoided is because of the potential for lost space that it introduces. For instance, theoretically, we could `unlink` the last pointer to a directory. That space on the drive that contains the list of `filename to inode` references itself would be "freed", but all of the files within the directory might be stuck on the drive forever and never freed. ### `-r` to the rescue This is why we have `-r` for `rm`. Because, in order to avoid the headaches described above, we need to explicitly delete each individual file before we unlink the directory itself. In fact, `unlink` doesn't work on directories, but only because the command itself is very simple and isn't built for that recursion, so it explicitly forbids it. That extra code would make the process of inefficient, and dangerous, especially if it's not something that we explicitly wanted to do. Otherwise, the filesystem might not realize that those files are no longer being referenced by the directory, because we didn't actually touch those files. We didn't explicitly delete the individual references to them, we just deleted the last list that contained all of those references. With that gone for good, those `inodes` are orphaned forever without a host directory. That would be a nightmare, and our large drive would quickly run out of space, and there'd be no telling why. ### Even though people do, we shouldn't even `rm` a softlinked directory The data loss @ORESoftware experienced is actually one of the reasons it is recommended to _avoid using `rm`_ on softlinked directories in general, and to use `unlink` on them instead. Getting into the practice of using `rm` on directory links is dangerous. Instead, `rm` is a more destructive and capable version of `unlink` that we only want to use if we explicitly want to purge a directory's contents from the drive. ## Further Reading It's important to understand that **unix filesystems don't distinguish between the "original" file/directory and the hardlink.** Because of this, we could technically create a `hardlink` to a file/directory, and then `unlink` the original location and the data would still exist. In fact, the `move` operation works exactly like this. It doesn't explicitly _move_ the data. It simply adds the reference to the new directory, and then removes it from the old directory. So `mv` on `foo/bar` to `quux/bar` does the following: ``` $ mv foo/bar quux/ ``` ```diff @@ quux/ + bar,134789 @@ foo/ - bar,134789 ``` That's why move operations are so fast on unix systems! 🙂 I find all of this super fascinating, so I thought I'd share for those who didn't realize.

When a directory is unlinked, the references inside of that directory aren't checked. The only requirement is that the directory is empty, because doing this type of recursive checking would be way too performance intensive. So, instead, unlink simply denies you from being able to unlink something if its a directory. And rm denies you from unlinking a directory without -r if it isn't empty.

This is explicitly to prevent orphaned inodes. This isn't completely unavoidable, but that's why we have fsck. But imagine having to run fsck on an in-use filesystem any time you deleted a directory.

That's actually the origin of lost+found. Dangling/orphaned inodes are put in lost+found if their hardlink was destroyed, but their inode and data wasn't cleaned up. Instead of purging the inode and its data, the assumption is that whatever happened wasn't supposed to, so fsck creates a hardlink to the inode throws it in lost+found so that it can either be recovered or permanently destroyed with rm -rf.


Path forward

The only realistic path forward for this would be a custom wrapper for unlink that performs the white-glove checks to make sure that things are in order: