ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.03k stars 3k forks source link

Add 'nosync' option #1616

Closed davidar closed 8 years ago

davidar commented 9 years ago

From #1324:

@barnacs: I guess the only viable option is then to reconsider durability guarantees and ease up on sync() calls. @davidar: In the meantime, would it be possible to add a --nosync option to ipfs add that just disables sync calls? You could add a warning that using it may result in inconsistencies in the blockstore, so people know the risks. @whyrusleeping: @davidar that would work for offline adds, but if youre doing it with the daemon online, you would have the start the daemon with such an option. I think the ideal stopgap would be to add a nosync field to the config for the datastore in .ipfs/config

jbenet commented 9 years ago

yeah this option would be nice.

whyrusleeping commented 9 years ago

I can add it pretty easily by patching flatfs in dev0.4.0 (where we actually respect the config for datastores)

davidar commented 9 years ago

Thanks. Performance of ipfs add is a major issue for ipfs/archives

whyrusleeping commented 9 years ago

@davidar i'll set up a branch for you with nosync hardcoded so you can work faster.

davidar commented 9 years ago

@whyrusleeping much appreciated :)

whyrusleeping commented 9 years ago

@davidar try this out

https://github.com/ipfs/go-ipfs/compare/temp-nosync

davidar commented 9 years ago

@whyrusleeping I'm getting the following error when trying to build (on both temp-nosync and dev0.4.0 branches):

./daemon.go:207: multiple-value repo.Config() in single-value context
rht commented 9 years ago

@davidar for now https://github.com/rht/go-ipfs/tree/dev0.4.0 and https://github.com/rht/go-ipfs/tree/temp-nosync, rebased on current master.

davidar commented 9 years ago

Thanks @rht

davidar commented 9 years ago

@whyrusleeping @rht Hmm, no dice. The temp-nosync branch is still painfully slow trying to add a directory with lots of small files (eta approaching 400h).

rht commented 8 years ago

Dependency: https://github.com/jbenet/go-datastore/pull/30.

(sync) up to 1000 files with each 1KB outdata_sync_1kb

(nosync) up to 1000 files with each 1KB outdata_nosync_1kb

(nosync) up to 5000 files with each 1KB (the bottleneck hasn't been characterized yet) outdata

jbenet commented 8 years ago

@rht only git is fair comparison. other's not really.

but anyway, sure. let's add it both:

davidar commented 8 years ago

@jbenet git still beats the pants off ipfs though (if I squint, I can almost see the line for git :p)

rht commented 8 years ago

Here is one with darcs and sqlite added.

(sync) up to 500 files each 1KB outdata

(nosync) up to 500 files each 1KB without sqlite outdata

Since offline add is equivalent to creating ipfs archive format, I think this benchmark is fair. Due to random files, no deduplication / loose object packing is involved. git was often compared with cp / rsync in the past (though git fetch instead of git add), hence is included here.

iirc the next bottleneck is protobuf Marshal, but I have to check again.

jbenet commented 8 years ago

nice. @rht want to shepherd these changes? we need:

rht commented 8 years ago

This is a major bottleneck to the archiving effort (otherwise ipfs archive could have meshed with ia sooner).

It's rather I'm writing the change (global config one) right away after you merge the sync flag in go-datastore. ipfs add --no-sync is hard when the daemon is on.

rht commented 8 years ago

(and on top of dev0.4.0 after dev0.4.0 rebase, since it has a lot of datastore changes)

davidar commented 8 years ago

This is a major bottleneck to the archiving effort

Even with nosync, there's still a major bottleneck somewhere :/

rht commented 8 years ago

Yes but at least faster than sync-sqlite.

Here is a breakdown of why things are slow: ipfs add -r -q Godeps

git: 278ms sync: addFile 67.480s

no-sync: addFile 918ms
  add 286ms
    importer.BuildDagFromReader 284ms
      bal.BalancedLayout 283ms
        db.Add 170ms (helpers.DagBuilderHelper)
          dagservice.Add 159ms
  addNode 432ms
    InsertNodeAtPath 491ms
      root.GetLinkedNode 370ms
        n.GetNodeLink 384ms
      dagservice.Add 136ms

Bottleneck of both add and addNode converges to dagservice.Add:

dagservice.Add 386ms
  nd.Encoded(false) 174ms
    sort.Stable 17ms
    n.Marshal 35.6 ms
    u.Hash 139ms
  n.Blocks.AddBlock 205ms
    s.Blockstore.Put 182ms
      block.Key().DsKey() 14ms
      bs.datastore.Has 63ms
      bs.datastore.Put 118ms
rht commented 8 years ago

The slowest part that can be optimized is perhaps n.GetNodeLink, the link search (basically to get the hash of the folder) can possibly be cached. Also, for every single file insert, does the hash of the folder containing it has to be recomputed?

What about InsertNodesAtPath for inserting several nodes at once?

davidar commented 8 years ago

Also, for every single file insert, does the hash of the folder containing it has to be recomputed?

It seems like a zipper could help here, which allows efficient traversal and mutation of persistent datastructures (like merkledags). This is what @argonaut-io uses, for example.

rht commented 8 years ago

Worth implementing, I think. It appears that there exists a filesystem based on zipper (http://okmij.org/ftp/continuations/zipper.html).

The merkle version of zipper is possible if the root hash and the sub-root hashes along the path to the nodes (node(s) for the case of concurrent mutation) aren't precomputed. In go-ipfs, indeed the hashes are computed only when the nodes are about to be committed to disk.

(@davidar now you don't have to squint https://github.com/ipfs/go-ipfs/pull/1964#issuecomment-156912258)

rht commented 8 years ago

(@whyrusleeping perhaps zipper could be a name for what you requested in https://github.com/ipfs/go-ipfs/blob/master/unixfs/mod/dagmodifier.go#L36 ?)

whyrusleeping commented 8 years ago

@rht i forgot about that comment, something along the lines of zipper sounds good to me there

jbenet commented 8 years ago

zipper :+1:

davidar commented 8 years ago

Worth implementing, I think. It appears that there exists a filesystem based on zipper (http://okmij.org/ftp/continuations/zipper.html).

Haha, of course Oleg would have written such a thing

ghost commented 8 years ago

@rht I want to deploy nosync to castor.i.ipfs.io, which branch should I use? Is there one based on dev0.4.0?

rht commented 8 years ago

@lgierth https://github.com/ipfs/go-ipfs/tree/dev0.4.0 contains nosync (by @whyrusleeping since 3 days ago). Sorry should have notified.

ghost commented 8 years ago

lovely :)

jbenet commented 8 years ago

is this issue good to be closed, then?

ghost commented 8 years ago

ipfs add with "NoSync": true is nice and fast on the dev0.4.0 hosts (castor, pollux, pluto)