Closed anarcat closed 7 years ago
Yep, this can be implemented as either (a) a different repo altogether, or (b) just a different datastore. It should certainly be an advanced feature, as moving or modifying the original file at all would render the objects useless, so users should definitely know what they're doing.
note it is impossible for ipfs to monitor changes constantly, as it may be shut down when the user modifies the files. this sort of thing requires an explicit intention to use it this way. An intermediate point might be to give ipfs a set of directories to watch/scan and make available locally. this may be cpu intensive (may require lots of hashing on each startup, etc).
the way git-annex deals with this is by moving the file to a hidden directory (.git/annex/objects/[hashtree]/[hash]
), making it readonly and symlinking the original file.
it's freaking annoying to have all those symlinks there, but at least there's only one copy of the file.
ipfs could track files the same way a media player trak its media collection:
Hello, my name is vitzli and I'm a dataholic
I am just a user and I operate several private/LAN repositories for open source projects, that gives me ability to update my operating systems and speed-up VM deployment when Internet connection doesn't work very well, right now Debian repository is approximately 173GB and 126000 files, debian images are about 120GB and I share them over BitTorrent (I'm using jigdo-lite to build them from the repository and download the difference between current repo and required template from the mirror), while I prefer to use public and official torrent trackers, some projects, like FreeBSD, do not offer torrent files and I get them over private trackers.
Same debian/centos images are there too and I don't mind sharing them for some sweet ratio. There is no need for them to be writable, so keep them as owned by root with 644 permissions. Unfortunately, people combine several images into one torrent and it breaks infohash (and DHT swarms got separated too) and I have to have two copies of iso images (I symlink/hardlink them to cope with that). As far as I understand, this won't be an issue with ipfs, but I would really like to keep those files there (as files in ro/root:644 partition, not symlinks/hardlinks; potentially, they could be mounted over local network). If ipfs could be used to clone/copy/provide CDN for Internet Archive and archive team - problems could be similar. Here is my list of demands, dealbreakers thoughts for ipfs addref
command (or whatever it may be called):
get
and addref
tasks - this seems excessive, but somebody may ask for it.Here's a disk usage plot when adding a large (~3.4 GiB) file:
[1059][rubiojr@octox] ./df-monitor.sh ipfs add ~/mnt/octomac/Downloads/VMware-VIMSetup-all-5.5.0-1623099-20140201-update01.iso
1.51 GB / 3.34 GB [===============================>--------------------------------------] 45.13 % 5m6s
Killed
~12 GiB used while adding the file.
Half way I need to kill ipfs
because I'm running out of space.
Somewhat related, is there a way to cleanup the partially added stuff after killing ipfs add
?
UPDATE: it seems that ipfs repo gc
helps a bit with the cleanup, but does not recover all the space.
A couple of extra notes about the disk usage:
ipfs init && ipfs daemon
ipfs add
and running ipfs repo gc
the file is added correctly using only the required disk space:[1029][rubiojr@octox] du -sh ~/.go-ipfs/datastore/
3,4G /home/rubiojr/.go-ipfs/datastore/
Anyway, I've heard you guys are working on an a new repo backed so I just added this for the sake of completion.
@rubiojr the disk space is being consumed by the eventlogs, which is on my short list for removing from ipfs. check ~/.go-ipfs/logs
@whyrusleeping not in this case apparently:
[~/.go-ipfs]
[1106][rubiojr@octox] du -h --max-depth 1
12G ./datastore
5,8M ./logs
12G .
[~/.go-ipfs/datastore]
[1109][rubiojr@octox] ls -latrS *.ldb|wc -l
6280
[~/.go-ipfs/datastore]
[1112][rubiojr@octox] ls -latrSh *.ldb|tail -n5
-rw-r--r-- 1 rubiojr rubiojr 3,8M abr 8 22:59 000650.ldb
-rw-r--r-- 1 rubiojr rubiojr 3,8M abr 8 23:00 002678.ldb
-rw-r--r-- 1 rubiojr rubiojr 3,8M abr 8 23:02 005705.ldb
-rw-r--r-- 1 rubiojr rubiojr 3,8M abr 8 23:01 004332.ldb
-rw-r--r-- 1 rubiojr rubiojr 3,8M abr 8 23:00 001662.ldb
6280 ldb files averaging 3.8MB files each. This is while adding a 1.7GiB file and killing the process before ipfs add
finishes. First ipfs add
after running ipfs daemon -init
.
The leveldb files did not average 3.8 MiB each, some of them were smaller in fact. My bad.
wow. That sucks. But should be fixed quite soon, i just finished the migration tool to move block storage out of leveldb.
since this is a highly requested feature, can we get some proposals of how it would work with the present fsrepo
?
My proposal would be a shallow repo that acts like an index of torrent files. Where it thinks it can serve a block until it tries to open the file from the underlying file system.
I'm not sure how to manage chunking. Saving (hash)->(file path, offset)
should be fine, I guess?
Saving (hash)->(file path, offset) should be fine
Something like (hash)->(file path, mtime, offset)
would help checking if the file was changed.
something like (hash)->(path, offset, length)
is what we would need. and rehash the data upon read to ensure the hash matches.
piecing it together with the repo is trickier. maybe it can be a special datastore that stores this index info in the flatfs, but delegates looking up the blocks on disk. something like
// in shallowfs
// stores things only under dataRoot. dataRoot could be `/`.
// stores paths, offsets, and a hash in metadataDS.
func New(dataRoot string, metadataDS ds.Datastore) { ... }
// use
fds := flatfs.New(...)
sfs := shallowfs.New("/", fds)
would be cool if linux supported symlinks to segments of a file...
Perhaps separating out the indexing operation (updating the hash->file-segment map) from actually adding files to the repo might work? The indexing could be done mostly separately from ipfs, and you'd be able to manually control what needs to be (re-)indexed. The blockstore then checks if the block has been indexed already (or passes through to the regular datastore otherwise).
Copy-on-write filesystems with native deduplication can be relevant here. For example https://btrfs.wiki.kernel.org
Copying files just adds little metadata, data extents are shared. I can use it with big torrents, edit files still being a good citizen and seeding the originals. Additional disk space usage is in the size of the edits.
symlinks to segments of a file
are just files sharing extents
On adding a file that is already in the datastore
you could trigger deduplication and save some space!
I am sure there is a lot of other more or less obvious ideas and some more crazy ones like using union mounts(unionfs/aufs) with ipfs as a RO fs with RW fs mounted over it for network live distro installation or going together with other VM stuff floating around here.
@striepan indeed! this all sounds good.
If anyone wants to look into making an fs-repo implementation patch, this could come sooner. (right now this is lower prio than other important protocol things.)
I agree with @striepan; I even believe that copy-on-write filesystems are the solution to this problem. What needs to be done in ipfs, though, is to make sure that the right modern API (kernel ioctl) is used for the copy to be efficient. Probably, go-ipfs just uses native go API for copying, so we should eventually benefit from go supporting recent Linux kernels, right? Can anybody here give a definite status report on that?
What would happen on Windows? (Are there any copy-on-write filesystems on Windows?)
I think Windows would be working as it is now.
What would BitTorrent clients do in such situations? Do they check only modification time and filesize of shared files after they are restarted?
@Mithgol Depending on the client, but most of them have a session-db with binary files describing active torrents and their state. These are just descriptors and much smaller, so there is no duplication. Some torrent clients take a while to start up if you have many torrent files active, suggesting a quick check of meta data for every file, but no complete hashing.
On 2016-03-13 10:04:31, lamarpavel wrote:
@Mithgol Depending on the client, but most of them have a session-db with binary files describing active torrents and their state. These are just descriptors and much smaller, so there is no duplication. Some torrent clients take a while to start up if you have many torrent files active, suggesting a quick check of meta data for every file, but no complete hashing.
I can't speak for all clients, but i believe transmission does a "quick" check (lstat, so size, mtime...) and rehashes if inconsistencies are found. the user can also re-trigger a rehash.
it is something like this i had in mind here, originally.
in my humble opinion, delegating this to a deduplicating filesystem just avoids the problem altogher: it doesn't fix the problem in ipfs, and assumes it will be fixed in the underlying filesystem.
there is a cost to such deduplication in the filesystem: it means checksums and so a performance and complexity cost which, in BTRFS, currently implies some reliability concerns, at least as far as production systems are concerned for me.
i hope and believe that ipfs can find better solutions for itself.
a.
Si les triangles avaient un Dieu, ils lui donneraient trois côtés.
in my humble opinion, delegating this to a deduplicating filesystem just avoids the problem altogher: it doesn't fix the problem in ipfs, and assumes it will be fixed in the underlying filesystem.
I feel like this is important given that IPFS may be run on platforms that do not have de-duplication file systems or on hardware that can't maintain a de-duplication feature due to hardware constraints (low memory, etc.). I feel that it's likely that a system that has storage constraints (one that would benefit from this feature) would probably also have memory limits but that's just a broad assumption, I'm thinking of SoC systems.
*nix systems have ZFS and BTRFS but to my knowledge Windows doesn't have any kind of standard/stable filesystem that has copy-on-write support, even ReFS does not support de-duplication itself but relies on external tools to handle it. I think relying on third party filesystems via Dokan may not be the best option either if IPFS can just handle this directly and on all platforms it runs on.
It would be unfortunate to reimpliment a feature like this if the underlying filesystem is handling it already but IPFS is in itself a filesystem so it should probably also have such a feature.
Perhaps it's worth looking into Direct Connect clients and how they handle their hash lists, I believe it's similar to this
transmission does a "quick" check (lstat, so size, mtime...) and rehashes if inconsistencies are found
I know of one that (optionally) stores metadata inside an NTFS stream so things like the hash, last hash time, size, mod time, etc. are stored with the file and can be read even if the file is moved around or modified, it does not modify the file itself though so it doesn't compromise integrity and doesn't rely on file names/paths, etc.. It's useful for keeping track of files even if the user messes around with them a lot, if the file is in /a/ and gets hashed then moves from /a/ to /b/ it doesn't have to be rehashed, the client checks to see if that data stream is there, does checks on it and knows it hasn't changed so it doesn't have to process it again which can save a lot of time and processing depending on file size and overall file count. Likewise if the file remains in the same path with the same name but was modified there will be a modtime discrepancy so it will trigger a rehash.
I don't know if other filesystems have a similar method of appending sibling data like that, or if there's a more portable solution that does the same thing but I feel like it's worth mentioning.
I might take a stab at this by creating a repo or datastore that simply uses files already in the filesystem. To me this seams an important step to getting a large amount of data into IPFS. Disk space is cheap but not free.
The best docs I can find are here https://github.com/ipfs/specs/tree/master/repo, if there is anything better please let me know.
Is there a way at having multiple repo or datastores with a single IPFS daemon? I am thinking one could be designated as the primary where all cached data goes and all the others secondary that require explicit actions to move data in or delete data from. For the purposes of this issue I think a read-only repo or datastore will be sufficient. In can read a set of directories (or files) to make available via a config file and can reread the file from time to time to pick up changes.
I'd like to work on that too.
You can see here how is it currently organized. There is:
On top of that Blockstore, a DagService allow to work on a higher level in the MerkleDag.
We could create a chain of responsability that fetch blocks either from the regular Blockstore or from a new blockstore that would provide blocks directly from regular on-disk files.
The meta-data that needs to be kept around could be stored either in the Datastore as key/value pairs or as a DAG as the Pinner does.
In issue #2053, jbenet said "by extending IPLD to allow just raw data edge nodes, we can make #875 easy to implement". What is the status of using raw data edge nodes?
To avoid duplicating files, two features are necessary:
Second one of those features will be hard to achieve as files might be sharded in many different ways which mean that some smart system of storing the wrapping data and pointer to raw data would have to be created.
Also I don't see those features being feasible on non CoW filesystems.
Is there an issue to discuss extending IPLD to allow just raw data edge nodes? This would be a basic part of the implementation.
I am still trying to understand IPFS internals. However, I believe that by using hard links it should be possible to both add and retrieve files stored on the same file system without duplication. It will require a new datastore implementation that stores the Data component of the unixfs_pb.Data object separately.
The idea is that when adding a file the path of the file is passed down to the Datastore instead of (or possible along with) the files contents. The datastore will change the permission on the file to be readonly to prevent accidental modification, and then create a hard link instead of copying the file contents. A similar process will be used when retrieving files.
Even though the permissions of the file will be set to readonly the file could still in theory be modified, but I assume there will be some sort of validation done at a higher level; after all there is nothing stopping anybody from modifying the file in the repo right now.
Creating hard links will not be the default mode of operation so it will be assumed that when this option is chosen the user knows what they are doing.
Although more fragile once the data content is stored as a separate file it should even be possible to use symbolic links. If symbolic links are used the links will always be to a file within the repo and the object should likely be pinned.
It should be possible to use this method even for large objects that have been split into several separate objects by pointing to part of a file. The large file is hard linked as normal and then several separate objects are created that point to a part of the hard linked file. Some extra bookkeeping will likely be need to know when it is safe to unlink the hard linked copy in the repo, but this should still be possible. Naturally there may be some unnecessary duplication within the repo. but it will save space overall on the file system.
Unless some of the core developers think this is a really bad idea I am am going to try and implement this to see how feasible it really is with the current code.
In my case, hard links to files on the same volume won't work. I need the option to not duplicate files added to IPFS precisely because I have a lot of large files across many connected volumes. In any case, I don't understand your idea to use either Unix symbolic or hard links. How can it link to just a 250KB slice of another file? I think you need to put a stub for the 250K block in the repo that has the path of the original large file plus a length and offset.
@jefft0
How can it link to just a 250KB slice of another file? I think you need to put a stub for the 250K block > in the repo that has the path of the original large file plus a length and offset.
That is basically what I intend to do. Sorry if that was not clear.
You just do not have anything to duplicate the data in the database, it's that simple. You have to work with real data, without the duplication of data in database. Use hardlink is a bad idea.
What do you think of jbenet's issue #2053 suggestion to use "raw data edge nodes" so that the 250K block is just the slice? Otherwise, when serving the block you would have to prepend and append CBOR bytes to make it an IPLD object. (I made a pull request.)
In my opinion, 'write' requirement for the large file filesystem violates principle of least authority, be it a hardlink/reflink or git-annex-style symlinks. Also, there is absolutely no guarantee that .ipfs directory and file storage are going to be on the same filesystem. And more likely than not it's going to be a remote (Samba, NFS, Ceph, GlusterFS, Lustre, what else?), read-only filesystem (not files owned by root with 644 permissions).
And if it's going to be done properly, it has to be read-only from the serving side (for example, 'read only = yes' on Samba share or R/O snapshot on zfs/btrfs), otherwise catastrophic data loss will happen in case of vulnerability in the ipfs daemon or even simple bugged line in rm -rf {dir}/{path}/*
ipfs-unrelated script.
I agree it is difficult and a security concern to write files fetched from IPFS without duplicating. My only use case is to publish: Some option like ipfs add --no-copy BigFile.mp4
. When someone else gets the file, it's OK with me if IPFS stores it in chunks in their repo.
@kevina Go ahead and give it a shot, if you need any help or have any questions definitely let me know (can always message me in irc).
Some things that might help out:
With our current setup, youre going to have to do the adds without the daemon online. When the daemon is running, the cli streams file data to the daemon for chunking and dag generation, to retain the files on disk without copying, you will need to change the importer to use a patched dagservice/blockstore that allows you to store filepath:offset
pairs instead of blocks of data.
@Kubuxu
Second one of those features will be hard to achieve as files might be sharded in many different ways which mean that some smart system of storing the wrapping data and pointer to raw data would have to be created.
Let's elaborate it further.
Imagine that the first of these features (publishing files to IPFS without duplicating) is implemented in the form ipfs add --no-copy filename.ext
.
Then the second of these features (saving files from IPFS without duplicating) could probably be based on the first. It would be basically the same as running three commands in the following order:
(Or would it not?)
@whyrusleeping can you elaborate some on this:
With our current setup, youre going to have to do the adds without the daemon online.
I am having a hard time seeing how this will simplify things.
@Mithgol not really, the ipfs cat $HASH | ipfs add
can give you different result hash than the input. It is due to changes in encoding, sharding and so on.
This will be especially visible when IPLD is implemented as there will be two encodings active in the network.
@Kubuxu
Are you implying that I can't ipfs get
a file, then clear the cache, then ipfs add
and deliver it? Because of the newer IPFS hash (that might be different) my file might not be discovered by the people using an older IPFS hash and the previously created hyperlinks to it?
If that is so, then that should be seen (and treated) literally as a part of the issue currently titled “avoid duplicating files added to ipfs”.
If two files are the same (content-wise), then their distribution and storage should be united in IPFS.
Otherwise storage efforts and distribution efforts are doubled and wasted. Also elements of the (so called) Permanent Web are suddenly not really permanent: when they are lost, they're designed to not be ever found, because even if someone somewhere discovers such lost file in an offline archive and decides to upload it to the Permanent Web, it is likely to yield a different IPFS hash and thus an old hyperlink (which references the original IPFS hash) is still doomed to remain broken forever.
If encodings and shardings and IPLD and maybe a dozen of other inner principles make it inevitable for the same files to have different IPFS hashes, then maybe yet another DHT should be added to the system and it should map IPFS hashes, for example, to cryptographic hashes (and vice versa) and then some subsystem would be able to deduplicate efforts of distribution and storage of the same files, would allow lost files to reappear in the network after uploading.
However, while this problem should be seen (and treated) literally as a part of the issue currently titled “avoid duplicating files added to ipfs”, this issue is still about deduplicating on disk. It probably is not wise to broaden its discussion here.
I've decided to open yet another issue (ipfs/notes#126) to discuss the advantages (or maybe the necessity) of each file having only one address determined by its content.
@kevina you will need to perform the adds without the daemon running because the daemon and the client arent necessarily on the same machine. If I try to 'zero-copy add' a file clientside, and i'm telling the daemon about this, The daemon has no idea what file i'm talking about, and has no reasonable way to reference that file.
Just FYI: I am making good progress on this. The first implementation will basically implement the @jefft0 --no-copy
option.
You can find my code at https://github.com/kevina/go-ipfs/tree/issue-875-wip. Expect lots of forced updates on this branch.
Sorry for all this noise. It seams GitHub keeps commits around forever even after doing a forced update. I will avoid using issue mentions in most of the commits to avoid this problem. :)
The code is now available at https://github.com/ipfs-filestore/go-ipfs/tree/kevina/filestore and is being discussed in pull request #2634,
Because this is a major change that might be too big for a single pull request I decided to maintain this as a separate fork while I work through the API issues with whyrusleeping.
I have created a README for the new filestore available here https://github.com/ipfs-filestore/go-ipfs/blob/kevina/filestore/filestore/README.md some notes on my fork are available here https://github.com/ipfs-filestore/go-ipfs/wiki.
At this point I could use testers.
it would be very useful to have files that are passed through
ipfs add
not copied into thedatastore
. for example here, i added a 3.2GB file, which meant the disk usage for that file now doubled!Basically, it would be nice if the space usage for adding files would be O(1) instead of O(n) where n is the file sizes...