Open Jorropo opened 3 years ago
👍 Taking advantage of copy-on-write support to create a better Filestore makes sense to me (this was also suggested in https://github.com/ipfs/go-ipfs/issues/7557).
I tend to think that support for reflinks should be something the user explicitly opts into (e.g. via config flag) otherwise docs/UX become basically impossible and you end up with something like "you can save space by using the Filestore on Linux and you're free to change your existing files, but on Windows everything will break if you modify your files".
As mentioned in the above issue deduping storage during ipfs get
is going to require more plumbing changes than ipfs add
(see https://github.com/ipfs/go-ipfs/issues/3981 for some background and previous ideas), just making a new datastore (even one the replaces the filestore) is insufficient.
Overall copy-on-write is generally great for data that is mostly immutable, looking forward to hearing how your experimentation goes.
Update: I have created a Go proposal which would make implementing this far easier: golang/go#52383
Be careful not to overthink it; I for one would be fine with the system temporarily using more disk space if it means long-term savings. If we know where all the files are in the datastore and file system, can we not simply load them in normally and then, later and separately, use the IOCTLs a la jdupes to remove the duplicates?
@calumapplepie that sounds harder to me than doing it right.
The biggest issue here is that our datastore returns []byte
.
If we could make it return io.Reader
then we can use type assertion like io.Copy
is doing and call io.WriteTo
or io.ReadFrom
after which go takes care of deduping for us because *os.File
implements io.ReaderFrom
by using copy_file_range
.
TL;DR
Use reflinks from the linux kernel to store IPFS's blobs for blocks and usefull files for the user (such as downloaded with
ipfs get
) using the same backing data blocks, deduping them.What are reflinks in too much details
IPFS's on disk datastores all needs to double store files, once in the datastore, once on the file system where the files is use full to you. This is anoying and require expensive copy times not for filestore, more on that later and double storages. However there is a solution !
copy_file_range
(godoc) (there are various other syscalls doing more less the same thing, this one is just near perfect for this use case)This just perform an in Kernel copy from one file descriptor to an other (with some offset and the capacity to append). However if the file-system allows it, this doesn't make a copy, this makes a reflink (Copy-on-Write copy of the file). In practice the file isn't copied, only the entry of it in the file-system (inode) is, so both files now share the same data blocks, however unlike what a hardlink is, if anyone of thoses files is modified instead of overwriting the data of both files, a new copy is made and the modified file's inode is changed to point to it.
This allows us to store multiple copy in the file-system (such as one for the datastore and X for the user) with the space usage of one.
Some file-systems such as btrfs are even capable to build DAGs (more precisely, B+trees) for the files, so if on a 1GB file you modify 1 byte somewhere, it's probable that only the block containing this byte is copied, and so for the near 1GB unmodified part both files to still share the storage. Appart from btrfs, other well known reflink capable file systems are XFS and ZFS.
Practically how could IPFS make it work
The modified datastore option
Like the
datastore.GCDatastore
create adatastore.ReflinkingDatastore
that would be attempted to cast to in the construction process of IPFS. The API could be similar to that :About
GetReflink
andPutReflink
: The*os.File
is just to fetch the file descriptor, could be replaced by anuintptr
maybe, but I like*os.File
more as this makes the type of the API explicit.offset
andsize
are used because from one single file you could create multiple blocks, or append multiple blocks into one file.offset
is the offset into thefile
.size
is the size of the copy, onGetReflink
this is implied by the size of the key. Implementations must notseek
file
as this could be called concurrently.It should be obvious how flat-fs could be made compatible with this using
copy_file_range
.CanReflink
test the file and returns the appropriate value (see the comments in the enum).Then while getting a file, we would first test if we are from the same machine as the API caller (as obviously nothing can be done if we are not in the same file-system). If so, open the target file, download it, and walk the block graph. Test if it make sense to use the reflink API (instead fallback to the current implementation). While walking the graph, for each block just call
GetReflink
passing the correct offsets.The reverse applies while adding (just send the correct
size
+offset
).Cross file-system
Reflinks doesn't work cross file-systems (most of the time, btrfs can reflink from an ext2,3,4 partition if it is readonly, this is used when converting ext partitions into btrfs, we can ignore this). So to not fallback to a copy across multiple file-systems we would need to support some kind of multiple datastore system (or a datastore capable to store in multiple paths), allowing to have a datastore folder in each file-system and picking the one reflink compatible.
The main issue is how we would handle data managment. Let's assume something quite simple, we have a main datastore and multiple flat-fs (one for each file system), the main datastore stores every block belonging no where and the flat-fs index for one stored in a an other file systems. So while getting a file, you would look up in the main datastore to see if the file is already in the flat-fs of the destination file-system. If it is then just reflink on top of the target file as usual and continue for other blocks. If it is not then first copy it in the flat-fs and then reflink it and continue. However if the user then remove or fully modify the file he just downloaded, we are keeping a reference to it for nothing, preventing to free up space. Idealy we would need some kind of purgeable reflink that the OS could remove to free up space if needed. But I don't think linux has this kind of thing yet.
The main drawbacks
copy_file_range
is only stable since5.3
(it has been introduced much earlier but this was a buggy inconsistent API, and it's generally advised to avoid it prior to5.3
), this is debian 11 (not even yet marked stable) or ubuntu 19.10 (20.04 for LTS) for example. It exists other syscalls less suited to the task but that could still make it work tho, I think this is not an issue, at first we can useuname
to enable it depending onx >= 5.3
while we figure out the architecture and then, if someone with an old kernel wants it, they could try to make it work with older kernels.About filestore
This is in some way better than filestore as the original files added through reflink would be modifiable, movable, ... without clobering IPFS's datastore.
Edits:
On my way to an implementation, a few issues cropped up.
copy_file_range
is probably not the best solution, it's great that it always work (when supported). But gives very little control about what the kernel does, if you want to reflink, but it can't the kernel will just copy internally (which is better than a manual copy but not great when you want to fall back to an other solution). There isioctl_ficlonerange
(godoc) which error on failure instead (in the end this doesn't matter as this is implementation details) (and alsoioctl_ficlonerange
is way older and should be supported on way more kernels).canReflink
syscall, the only way to know is to try and maybe fail so theCanReflink
call seem inpratical for now.