Using ipfs for decentralized file synchronization

sahib commented 9 years ago

Hi,

we're two M.Sc. CS students from Germany. For our master thesis project (and more!) we'd like to write a decentralized file synchronization tool. On the course of our research, we luckily found IPFS, which might be a ideal base to act in the background distributing and storing files. After some days of looking around, there are still some questions left and we thought, that they should be better answered by someone with more insight.

For context, here's a very rough description of our planned project. If you want to skip right to the questions, see below for details. We plan to write a decentralized software alternative to centralized file synchronization tools like Dropbox. A bit like a mix of syncthing (easy to use, but still has a central keyserver), git-annex (cool, but too complicated/manual for daily use) and plain git (robust and reasonably easy/secure).

Our architecture would look a bit like the following, every ,,repository'' has a hidden directory with config files and an IPFS datastore. The user can view the repository by e.g. mounting a FUSE filesystem somewhere. This filesystem would fetch the data from IPFS (similar to what the ipfs fuse module already does) and add files that were moved into it to IPFS. We would basically do the housekeeping of those hashes and provide git-like convenience for them. Some special ,,repository nodes'' might also just pin old files, so the node can serve as a ,,backup/archive'' node for others. Other repository node types might exist too. The language of choice is Go, which is another big plus for IPFS. A possible --help of the tool might look like this:

https://gist.github.com/sahib/a9281b30649de4e62afb

Our remaining questions follow. Feel free to point us to the documentation or source if we missed something obvious:

Is it possible to run more than one IPFS repository on the same host? The commandline tool does not seem to support that (yet?), but are there any technical issues on running more than one IPFS node on a computer (with different datastores and configs)?
A nice to have feature would be partial/resumable downloads. If you transfer big files over a crappy connection, it would be nice to resume the download at the point where the connection got lost (like rsync does). We didn't find a definite answer here, feel free to point us to the docs. (I guess you could be doing this yourself by splitting the files into chunks and feed only those to IPFS. Our software would than remember the chunk order to reconstruct a file. Not a pretty solution though.)
Is the data transferred to other peers in an encrypted way? If so, what's the used encryption standard and architecture? Again, a pointer to the relevant source/docs is totally fine.
In our software, file synchronization should only happen between trusted peers. From our understanding, by using IPFS you can always get a file, if you have it's hash and if you are connected to the swarm. How would it be possible to create only a sub-net of IPFS peers, where only trusted peers are allowed to communicate with each other?
The ,,trust'' part could be done by using XMPP/OTR in our software and verify the identity of a peer over this sidechannel, before allowing a connection to the IPFS swarm. It is very important for us to ensure that no unauthorized parties are allowed to access the data (or metadata) stored in a repository.

To sum up, the general idea is to build a tool/structure on top of IPFS which is secure enough for ,,business'' demands but also as easy to use as e.g. Dropbox.

Thanks in advance for any answers. Please tell us if the above-mentioned might be a bad idea or what you probably would do in a different way.

Best regards, Christopher & Christoph

cloutier commented 9 years ago

IPFS already uses chunks by default. For exemple, this picture is stored like this underneath:
You don't have to have the same amount of node as your amount of repository. You could use something like one IPNS hash (notice the N) per repo.
I think encrypting everything (with only trusted peers having the key) would be simpler than a subnet.
Have you taken a look at bittorent sync? Their ability to seed for others without knowing the content is interesting. This feature could be a way to make a business around a free/libre software project.

I'm really not the most knowledgeable about IPFS around here, but I hope I helped a bit :smile:

sahib commented 9 years ago

Hello @cloutier,

thanks for your answer, they were indeed helpful.

Ah, I worded my question a bit ambiguously. I was aware that files actually get stored as blocks in the merkledag, but was not sure if transmitted blocks was cached, even if not the whole "file" was transmitted (i.e. for rsync-like resume-download functionality).
Thanks that's helpful, but still, is it possible to use several repositories? I think, that would look a bit cleaner in my usecase, since all data would reside in a single folder.
I was afraid of that answer :smile:. This would require us to decrypt every file on reading or alternatively store files twice. Probably a mix of both (caching often used files) would be a possible way.
No, I did not look at that yet. If we encrypt everything and only exchance those file via IPFS, that would apply to us too, would it? Maybe I missed your point here.

ion1 commented 9 years ago

Your project sounds promising.

By the time go-ipfs supports on-disk encryption, it will also support multiple IPNS keys.

go-ipfs supports chunking files by a rolling hash with --chunker=rabin and that will become the default in the future. This results in automatic rsync-like behavior when downloading a file while already having some of its contents. At the moment, the default chunking only results in efficient updates if they only consist of additions or changes to the end of the file or just happen to match the chunk boundaries in the middle of a file.

If you interrupt a download, everything downloaded so far will be cached. You can resume efficiently before it is garbage-collected.

I think I have seen @jbenet say go-ipfs is going to support only letting certain peers download certain objects from you, but my memory is hazy on that. It would be an additional protection in addition to encryption (in case the encryption turns out to be flawed or something).

The XMPP/OTR functionality could be useful for a number of projects. I’m envisioning tool that lets you send and receive a message saying “I am <_output of ipfs id_>” signed with your node’s private key using XMPP, email, QR codes on mobile displays or whatever, and use existing ways to contact friends to establish cryptographic trust between IPFS nodes. Of course, the standard problem of MitM attacks applies here. OTR protects against it with the socialist millionaire protocol.

sahib commented 9 years ago

Thanks @ion1, that was helpful too. Especially the prospect of rolling hashes sounds good. I think we're going with ipfs now and see what we can do with it.

All above questions are resolved I guess, the one with "multiple ipfs repos" was obvious enough, to figure it out myself:

$ IPFS_PATH=/tmp/repo ipfs init    # Create a repo at /tmp/repo
$ IPFS_PATH=/tmp/repo ipfs daemon  # Works fine too apparently.

Closing this now, we'll open a new, more concrete issue over at go-ipfs if we run into problems.

Thanks again.

jbenet commented 9 years ago

Hello @sahib (Christopher and Christoph),

Our architecture would look a bit like the following, every ,,repository'' has a hidden directory with config files and an IPFS datastore. The user can view the repository by e.g. mounting a FUSE filesystem somewhere. This filesystem would fetch the data from IPFS (similar to what the ipfs fuse module already does) and add files that were moved into it to IPFS.

You might consider keeping the config files themselves in IPFS. after dev 0.4.0 we have distinct data stores based on whether content should be served or not. Later on, we'll have capability based access.

We would basically do the housekeeping of those hashes and provide git-like convenience for them. Some special ,,repository nodes'' might also just pin old files, so the node can serve as a ,,backup/archive'' node for others. Other repository node types might exist too.

Yep. We try to enable these types to be made by composing commands, but some won't be fully possible natively (i.e. without another protocol) until we have pub/sub in. Once that happens, should be able to make all these as simple scripts.

The language of choice is Go, which is another big plus for IPFS. A possible --help of the tool might look like this:

https://gist.github.com/sahib/a9281b30649de4e62afb

So much of this looks like stuff we want to have as part of go-ipfs, some planned and specced out, some not yet -- you should consider making your implementation into something we can merge into go-ipfs itself.

Is it possible to run more than one IPFS repository on the same host? The commandline tool does not seem to support that (yet?), but are there any technical issues on running more than one IPFS node on a computer (with different datastores and configs)?

Yes, try setting the $IPFS_PATH variable.

We should definitely improve the docs on this. any suggestions on where?

A nice to have feature would be partial/resumable downloads. If you transfer big files over a crappy connection, it would be nice to resume the download at the point where the connection got lost (like rsync does). We didn't find a definite answer here, feel free to point us to the docs. (I guess you could be doing this yourself by splitting the files into chunks and feed only those to IPFS. Our software would than remember the chunk order to reconstruct a file. Not a pretty solution though.)

As @ion1 mentioned, IPFS already chunks everything for you -- using a variety of chunkers, which you can configure or even implement yourself.

So yes, you get "resumable downloads" at the chunk level.

It would be good to have a nice torrent-like interface to visualize some of this progress.

Is the data transferred to other peers in an encrypted way? If so, what's the used encryption standard and architecture? Again, a pointer to the relevant source/docs is totally fine.

All comm is p2p encrypted (security warning: not audited yet) using a TLS-like protocol. it will be swapped for {TLS, CurveCP, MinimaLT} in the future.

All data can be encrypted at rest. for now, before being added to ipfs. in the future, ipfs will have facilities to do this directly.

In our software, file synchronization should only happen between trusted peers. From our understanding, by using IPFS you can always get a file, if you have it's hash and if you are connected to the swarm. How would it be possible to create only a sub-net of IPFS peers, where only trusted peers are allowed to communicate with each other?

Follow this discussion: https://github.com/ipfs/go-ipfs/issues/1633

it's actually quite easy to implement what you want-- just need to add a Conn wrapper that XORs everything using a shared key and a nonce.

The ,,trust'' part could be done by using XMPP/OTR in our software and verify the identity of a peer over this sidechannel, before allowing a connection to the IPFS swarm. It is very important for us to ensure that no unauthorized parties are allowed to access the data (or metadata) stored in a repository.

Yep that works too. You may not need the XMPP overhead-- but you might want to use XMPP as a libp2p discovery protocol. (cc @diasdavid)

To sum up, the general idea is to build a tool/structure on top of IPFS which is secure enough for ,,business'' demands but also as easy to use as e.g. Dropbox.

Yeah we want to get to this too and would love to support your efforts. I'd request that you consider contributing directly to go-ipfs since much of what you want we want too.

And thanks to @cloutier and @ion1 for answering these and more.

sahib commented 9 years ago

Hi @jbenet, thanks for the detailed answer.

You might consider keeping the config files themselves in IPFS. after dev 0.4.0 we have distinct data stores based on whether content should be served or not. Later on, we'll have capability based access.

The config file itself should probably stay outside, since it should be editable by the user in case something goes wrong - also in there are the connection details to the IPFS datastore.

Yep. We try to enable these types to be made by composing commands, but some won't be fully possible natively (i.e. without another protocol) until we have pub/sub in. Once that happens, should be able to make all these as simple scripts.

Pub/Sub would work for us over presence messages over XMPP (See also below).

As @ion1 mentioned, IPFS already chunks everything for you -- using a variety of chunkers, which you can configure or even implement yourself. So yes, you get "resumable downloads" at the chunk level. It would be good to have a nice torrent-like interface to visualize some of this progress.

Awesome! I agree, API to get progress statistics would be nice.

All comm is p2p encrypted (security warning: not audited yet) using a TLS-like protocol. it will be swapped for {TLS, CurveCP, MinimaLT} in the future. All data can be encrypted at rest. for now, before being added to ipfs. in the future, ipfs will have facilities to do this directly.

That's re-assuring to hear! We will go for doing the encryption on top of IPFS for now (basically acting like a encfs layer over IPFS).

Yep that works too. You may not need the XMPP overhead-- but you might want to use XMPP as a libp2p discovery protocol. (cc @diasdavid)

XMPP might give us a few nice extras like a builtin roster (i.e. a buddy list), Pub/Sub and easy authentication. And last but not at least - an existing infrastructure. Drawback is the ugly XML at it's heart, but that's just a small drop of bitterness. We'll look into libp2p for discovery, that would be nice for brig discover.

Yeah we want to get to this too and would love to support your efforts. I'd request that you consider contributing directly to go-ipfs since much of what you want we want too.

We would be very happy to contribute back to IPFS. For now, we will develop in our codebase for a variety of reason: Since it's a master thesis, a concluded software would be desirable. Apart from that, some of our concepts might not be fully compatible (XMPP, focus on physical files only, auto sync policies...).

Still, we will try to make shareable things like the encryption layer, commit handling and fuse as general as possible, so getting the features back to IPFS if necessary will be easy. From our side we'll also try to favour libraries that ipfs already uses (like multihash).

jbenet commented 9 years ago

Great! All sounds good :) On Thu, Nov 19, 2015 at 10:07 Chris Pahl notifications@github.com wrote:

Hi @jbenet https://github.com/jbenet, thanks for the detailed answer.

You might consider keeping the config files themselves in IPFS. after dev 0.4.0 we have distinct data stores based on whether content should be served or not. Later on, we'll have capability based access.

The config file itself should probably stay outside, since it should be editable by the user in case something goes wrong - also in there are the connection details to the IPFS datastore.

Yep. We try to enable these types to be made by composing commands, but some won't be fully possible natively (i.e. without another protocol) until we have pub/sub in. Once that happens, should be able to make all these as simple scripts.

Pub/Sub would work for us over presence messages over XMPP (See also below).

As @ion1 https://github.com/ion1 mentioned, IPFS already chunks everything for you -- using a variety of chunkers, which you can configure or even implement yourself. So yes, you get "resumable downloads" at the chunk level. It would be good to have a nice torrent-like interface to visualize some of this progress.

Awesome! I agree, API to get progress statistics would be nice.

All comm is p2p encrypted (security warning: not audited yet) using a TLS-like protocol. it will be swapped for {TLS, CurveCP, MinimaLT} in the future. All data can be encrypted at rest. for now, before being added to ipfs. in the future, ipfs will have facilities to do this directly.

That's re-assuring to hear! We will go for doing the encryption on top of IPFS for now (basically acting like a encfs layer over IPFS).

Yep that works too. You may not need the XMPP overhead-- but you might want to use XMPP as a libp2p discovery protocol. (cc @diasdavid https://github.com/diasdavid)

XMPP might give us a few nice extras like a builtin roster (i.e. a buddy list), Pub/Sub and easy authentication. And last but not at least - an existing infrastructure. Drawback is the ugly XML at it's heart, but that's just a small drop of bitterness. We'll look into libp2p for discovery, that would be nice for brig discover .

Yeah we want to get to this too and would love to support your efforts. I'd request that you consider contributing directly to go-ipfs since much of what you want we want too.

We would be very happy to contribute back to IPFS. For now, we will develop in our codebase for a variety of reason: Since it's a master thesis, a concluded software would be desirable. Apart from that, some of our concepts might not be fully compatible (XMPP, focus on physical files only, auto sync policies...).

Still, we will try to make shareable things like the encryption layer, commit handling and fuse as general as possible, so getting the features back to IPFS if necessary will be easy. From our side we'll also try to favour libraries that ipfs already uses (like multihash).

— Reply to this email directly or view it on GitHub https://github.com/ipfs/ipfs/issues/120#issuecomment-158140187.

rschulman commented 8 years ago

If XMPP isn't quite right, consider Matrix as an alternative. Lighter weight (just JSON over https) and it might do what you're looking for.

Wizek commented 7 years ago

Hi @sahib,

Have you made progress with your IPFS-based syncing solution? And if so, would you be willing to share an update?

At any rate, I'm looking around if such solutions may exist, so if you or anyone else can point me towards a similar or relevant project I'd be curious to hear about it.

sahib commented 7 years ago

Hello @Wizek,

sorry for the late reply.

Have you made progress with your IPFS-based syncing solution? And if so, would you be willing to share an update?

Yeah, the current implementation implements a tool written in Go that's similar to what was proposed above. There are some notable differences though:

Focus on security, but still supposed to be easily usable.
No jabber anymore, current implementation directly uses IPFS to communicate between nodes.
Implements git like behaviour on top of IPFS, but strictly separates metadata (stored in a BoltDB) from the actual data (stored in IPFS).

In its current stages it is able to add files, sync them (if you are very lucky) and serve a FUSE filesystem (and quite a bit more).

The source can be found here:

https://github.com/disorganizer/brig

It's in very early development stages and currently the development slowed down, since we did not get any funding as we hoped to. So after all, we only got the time to develop it during our master thesis and, at least I, was forced to take a normal day job to pay my bills. So, in short: brig is not dead, it will just move forward very slowly. The two master thesis can be found here (in german though, sorry):

https://disorganizer.github.io/brig-thesis/brig/thesis.pdf (architecture)
https://github.com/disorganizer/brig-thesis/raw/master/security/pdf/thesis-final.pdf (security)

Even if it doesn't get too much attention it might be worth looking at and some bits like the compression and encryption layer might be even usable for IPFS itself.

At any rate, I'm looking around if such solutions may exist, so if you or anyone else can point me towards a similar or relevant project I'd be curious to hear about it

I would be interested to hear about that too, but the most similar one (albeit not IPFS based) is most likely bazil. It's still not very developed though.

Note: I will likely not reply very often, since additionally to my day job I'm also moving.

Regards, Chris

codearoo commented 7 years ago

I am also hunting and hoping for a solution like this. The Brazil project sounds perfect. That functionality (using IPFS I guess) would rock!

sahib commented 7 years ago

Hello @codearoo,

good news then. I'm working on brig again since a few months in my spare time and making good progress. Since I had quite a break and could think a bit more on the design, I decided to rewrite some parts of it. I will post a notice here once I have something that can be shown to the world (probably end of this year/start of next one).

The Brazil project sounds perfect. That functionality (using IPFS I guess) would rock!

Actually the goals are similar enough to call brig a bazil based on ipfs. We even share the same fuse bindings (written by the bazil author).

Regards, Chris

sahib commented 6 years ago

Okay, I'm a little latte. About 9 months too late actually. But I just released a first version of brig (v0.2.0), that should be suitable for trying it out. It still has rough edges, but it needs user input to progress. Please continue all discussion here. I guess this issue can be forgotten since the original topic has been solved.

ipfs / notes

Using ipfs for decentralized file synchronization #364