Make IPFS reflink aware, dedup file storage between IPFS and user downloaded files

Jorropo commented 3 years ago

TL;DR

Use reflinks from the linux kernel to store IPFS's blobs for blocks and usefull files for the user (such as downloaded with ipfs get) using the same backing data blocks, deduping them.

What are reflinks in too much details

IPFS's on disk datastores all needs to double store files, once in the datastore, once on the file system where the files is use full to you. This is anoying and require expensive copy times not for filestore, more on that later and double storages. However there is a solution ! copy_file_range (godoc) (there are various other syscalls doing more less the same thing, this one is just near perfect for this use case)

This just perform an in Kernel copy from one file descriptor to an other (with some offset and the capacity to append). However if the file-system allows it, this doesn't make a copy, this makes a reflink (Copy-on-Write copy of the file). In practice the file isn't copied, only the entry of it in the file-system (inode) is, so both files now share the same data blocks, however unlike what a hardlink is, if anyone of thoses files is modified instead of overwriting the data of both files, a new copy is made and the modified file's inode is changed to point to it.

This allows us to store multiple copy in the file-system (such as one for the datastore and X for the user) with the space usage of one.

Some file-systems such as btrfs are even capable to build DAGs (more precisely, B+trees) for the files, so if on a 1GB file you modify 1 byte somewhere, it's probable that only the block containing this byte is copied, and so for the near 1GB unmodified part both files to still share the storage. Appart from btrfs, other well known reflink capable file systems are XFS and ZFS.

Practically how could IPFS make it work

The modified datastore option

Like the datastore.GCDatastore create a datastore.ReflinkingDatastore that would be attempted to cast to in the construction process of IPFS. The API could be similar to that :

type ReflinkType uint
const (
  // Reflink impossible, attempting a reflink would fail or have no advantage compare to calling `Get` and copying the result
  ReflinkType_NoReflink ReflinkType = iota
  // No reflink, however a fast copy (such as a kernel one) is possible
  ReflinkType_FastCopy
  // Reflinking, so deduping the data on the storage medium
  ReflinkType_Reflink
)

type ReflinkDatastore interface {
  Datastore

  GetReflink(key Key, file *os.File, offset int64) error
  PutReflink(key Key, file *os.File, offset int64, size int) error

  CanReflink(file *os.File) ReflinkType
}

About GetReflink and PutReflink : The *os.File is just to fetch the file descriptor, could be replaced by an uintptr maybe, but I like *os.File more as this makes the type of the API explicit. offset and size are used because from one single file you could create multiple blocks, or append multiple blocks into one file. offset is the offset into the file. size is the size of the copy, on GetReflink this is implied by the size of the key. Implementations must not seek file as this could be called concurrently.

It should be obvious how flat-fs could be made compatible with this using copy_file_range.

CanReflink test the file and returns the appropriate value (see the comments in the enum).

Then while getting a file, we would first test if we are from the same machine as the API caller (as obviously nothing can be done if we are not in the same file-system). If so, open the target file, download it, and walk the block graph. Test if it make sense to use the reflink API (instead fallback to the current implementation). While walking the graph, for each block just call GetReflink passing the correct offsets.

The reverse applies while adding (just send the correct size + offset).

Cross file-system

Reflinks doesn't work cross file-systems (most of the time, btrfs can reflink from an ext2,3,4 partition if it is readonly, this is used when converting ext partitions into btrfs, we can ignore this). So to not fallback to a copy across multiple file-systems we would need to support some kind of multiple datastore system (or a datastore capable to store in multiple paths), allowing to have a datastore folder in each file-system and picking the one reflink compatible.

The main issue is how we would handle data managment. Let's assume something quite simple, we have a main datastore and multiple flat-fs (one for each file system), the main datastore stores every block belonging no where and the flat-fs index for one stored in a an other file systems. So while getting a file, you would look up in the main datastore to see if the file is already in the flat-fs of the destination file-system. If it is then just reflink on top of the target file as usual and continue for other blocks. If it is not then first copy it in the flat-fs and then reflink it and continue. However if the user then remove or fully modify the file he just downloaded, we are keeping a reference to it for nothing, preventing to free up space. Idealy we would need some kind of purgeable reflink that the OS could remove to free up space if needed. But I don't think linux has this kind of thing yet.

The main drawbacks

This is a relatively new feature in the linux kernel. The API for copy_file_range is only stable since 5.3 (it has been introduced much earlier but this was a buggy inconsistent API, and it's generally advised to avoid it prior to 5.3), this is debian 11 (not even yet marked stable) or ubuntu 19.10 (20.04 for LTS) for example. It exists other syscalls less suited to the task but that could still make it work tho, I think this is not an issue, at first we can use uname to enable it depending on x >= 5.3 while we figure out the architecture and then, if someone with an old kernel wants it, they could try to make it work with older kernels.
This kind of aggressive multiple reflinking will lead to atrocious fragmentation. We can expect the performance of this feature to be really bad on hdd, not really while copying as file-systems with allocation group are capable to deal with fragmented writes quite well. But while reading too. I don't think this is a reason to not do it, it's not because HDDs aren't suitable to the task that SSDs should be holded back like that.
This makes space usage reports confusing and the disk full situation more problematic. Thoses are general tradeoff of reflinking file-systems, someone running them should already know and expect that.

About filestore

This is in some way better than filestore as the original files added through reflink would be modifiable, movable, ... without clobering IPFS's datastore.

Edits:

On my way to an implementation, a few issues cropped up.

copy_file_range is probably not the best solution, it's great that it always work (when supported). But gives very little control about what the kernel does, if you want to reflink, but it can't the kernel will just copy internally (which is better than a manual copy but not great when you want to fall back to an other solution). There is ioctl_ficlonerange (godoc) which error on failure instead (in the end this doesn't matter as this is implementation details) (and also ioctl_ficlonerange is way older and should be supported on way more kernels).
The kernel doesn't have a neat canReflink syscall, the only way to know is to try and maybe fail so the CanReflink call seem inpratical for now.

aschmahmann commented 3 years ago

👍 Taking advantage of copy-on-write support to create a better Filestore makes sense to me (this was also suggested in https://github.com/ipfs/go-ipfs/issues/7557).

I tend to think that support for reflinks should be something the user explicitly opts into (e.g. via config flag) otherwise docs/UX become basically impossible and you end up with something like "you can save space by using the Filestore on Linux and you're free to change your existing files, but on Windows everything will break if you modify your files".

As mentioned in the above issue deduping storage during ipfs get is going to require more plumbing changes than ipfs add (see https://github.com/ipfs/go-ipfs/issues/3981 for some background and previous ideas), just making a new datastore (even one the replaces the filestore) is insufficient.

Overall copy-on-write is generally great for data that is mostly immutable, looking forward to hearing how your experimentation goes.

Jorropo commented 2 years ago

Update: I have created a Go proposal which would make implementing this far easier: golang/go#52383

calumapplepie commented 8 months ago

Be careful not to overthink it; I for one would be fine with the system temporarily using more disk space if it means long-term savings. If we know where all the files are in the datastore and file system, can we not simply load them in normally and then, later and separately, use the IOCTLs a la jdupes to remove the duplicates?

Jorropo commented 8 months ago

@calumapplepie that sounds harder to me than doing it right.

The biggest issue here is that our datastore returns []byte. If we could make it return io.Reader then we can use type assertion like io.Copy is doing and call io.WriteTo or io.ReadFrom after which go takes care of deduping for us because *os.File implements io.ReaderFrom by using copy_file_range.

ipfs / kubo