ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.11k stars 3.01k forks source link

GC is slower with badger (with syncwrites enabled) #4298

Open Stebalien opened 7 years ago

Stebalien commented 7 years ago

GC is 4x slower with the badger datastore as it actually has to write data, not just delete files. We may need to better batch/parallelize deletes (probably want a DeleteBlocks method).

magik6k commented 7 years ago

GC also needs to be called on badger instance, we might want to expose this too.

Kubuxu commented 7 years ago

I can do batching as part of https://github.com/ipfs/go-ipfs/pull/4149/ , go-ds-flatfs doesn't have real delete baching (it just was queuing them up and doing them all at the end) so that is why it was never used there.

Stebalien commented 7 years ago

@Kubuxu Sounds like a good idea (although it can be a separate PR if it adds too much code, large PRs are a pain to review and/or rebase).

Kubuxu commented 7 years ago

It shouldn't be but making it a separate PR is good idea either way.

schomatis commented 6 years ago

@Stebalien

GC is 4x slower with the badger datastore as it actually has to write data, not just delete files.

Yes, and even more expensive than the rewrite operations during GC are it's searches of every key in the value log file being checked to decide if they exceed the threshold to trigger the rewrite. Do you have a test that would point to that 4x performance impact?

We may need to better batch/parallelize deletes (probably want a DeleteBlocks method).

I'm not understanding how the parallelization would help, is GC called after every block deletion?

schomatis commented 6 years ago

So, trying to reduce my own noise: indeed GC is slower with syncWrites enabled (4x sound about right), Badger's creator suggested turning it off during GC (not sure if that is possible) and also to parallelize deletes as mentioned here (or alternatively running bs.DeleteBlock(k) concurrently in multiple goroutines).

I can confirm that (from simple tests) GC with syncWrites disabled has pretty much the same performance as flatfs and also that the actual Badger's GC (triggered when there are more than one value log file, i.e., more than 1GB of data in the repo) has a running time that is not much more than flatfs (1.25-1.5x), more tests are needed.