Make badger-ds the default datastore

Stebalien commented 7 years ago

This is the master issue to centralize all other issues/PRs related to the Badger transition.

The priorities to check before the transition are:

DB integrity. That means, besides minimizing data loss in case of errors such as a system crash or running out of disk space, to always keep a consistent database for Badger (and IPFS) to be able to start. Even though some truncation for example might be needed, the scenario to avoid is for Badger to encounter a DB it can't work with (e.g., a failed assertion that Badger doesn't know how to recover from) and refusing to start (which would mean IPFS would not work) without some manual interaction (which can't be expected from the normal end-user).
Performance in worst case scenarios. We are transitioning from a flat file-system storage (one key, one file) which in most cases has a (much) lower performance than Badger, but there are some scenarios (e.g., GC or some search cases) where a flat architecture may outperform Badger (or any other LSM architecture for that matter), that should be minimized as much as possible so the end-user won't notice the transition.

The active issues (mostly the ones tagged with badger) are:

Testing:
- [ ] Re-run all sharness tests with badger: #4549
Integrity:
- [x] #4921: general discussion of ACID in Badger.
- [x] #4511: (possible) corrupted hash in DB.
- [x] https://github.com/dgraph-io/badger/issues/490: Manifest maximum changeset size.
- [x] https://github.com/golang/go/issues/35358: sync writes are broken on linux. Otherwise, fix https://github.com/ipfs/go-datastore/issues/137 to sidestep this issue.
Performance:
- [x] #4870: general benchmarks.
- [x] #4298: GC - will be fixed by https://github.com/ipfs/go-datastore/issues/137 (async writes).
- [ ] Memory usage while garbage collecting (https://github.com/ipfs/go-ds-badger/issues/86)
- [ ] GC doesn't actually work in v1: https://github.com/ipfs/go-ds-badger/issues/54#issuecomment-819866098
Usability:
- [ ] #4824: what to expect from GC (small amounts of data -single value log file- won't be GC'ed).
- [x] #4301: Windows lock file management.
- [x] https://github.com/ipfs/go-ipfs/issues/5040
- [x] https://github.com/ipfs/go-ds-badger/issues/51 - We still need to run the datastore level GC even if block GC is disabled.
Nice to have:
- [ ] https://github.com/ipfs/ipfs-ds-convert/issues/9: conversion to badger.

schomatis commented 6 years ago

@Stebalien Do you mind if I hijack this issue to keep track of all the other issues related to the Badger transition?

Stebalien commented 6 years ago

@schomatis go right ahead!

ajbouh commented 6 years ago

Is this still on track to happen sometime soon?

Stebalien commented 6 years ago

In addition to the issues listed in the description, ~we're still working through some recovery issues~ (not a bug) and memory usage is pretty bad (we may be able to tune this a bit ~but I'm getting some really weird behavior on Linux)~ (can't reproduce anymore).

Basically, we can't roll this out until:

We can always recover after a crash.
It doesn't eat ram needlessly.

schomatis commented 6 years ago

Thanks for the reference, I should add those.

djdv commented 5 years ago

I doubt this effects most users but I'm linking it anyway. My own instance of IPFS runs with flatfs hosted on an SMB/CIFS share. badger doesn't currently handle this: https://github.com/dgraph-io/badger/issues/699 although it can.

For full context, I do this because my local disks are small. And I can't run IPFS on the remote machine because components of libp2p don't build on Solaris yet. (when trying to port it I encountered an oddity where the Go standard library says something is implemented but it isn't)

Kubuxu commented 5 years ago

@djdv that is why we provide other datastore implementations and simple switches to initialise repo with different configurations.

Stebalien commented 5 years ago

@magik6k IMO, we should be able to graduate badger from experimental even if we don't go ahead and make it the default. However, we may want to land https://github.com/ipfs/go-ds-badger/issues/51 first.

ZerxXxes commented 4 years ago

Hey, whats the status here? I see that https://github.com/ipfs/go-ds-badger/issues/51 is closed, does that mean that the badger datastore is to be considered pretty mature now?

Kubuxu commented 4 years ago

We should probably update to badger v2 before using it as default.

RubenKelevra commented 4 years ago

I just want to note that flatfs has some advantages: When using ZFS as the underlying filesystem, (with a blocksize set to 256 K for ZFS and raw-leaves for IPFS) you can dedup the blockstorage against a copy of the data outside the block storage you might need to hold for a different service, like http. This isn't possible with data stored inside a database.

It would be nice if support for flatfs isn't dropped in the future. :)

Stebalien commented 4 years ago

At the moment, we plan on keeping flatfs. It has a tendency to "just work everywhere". The main downside is that it's impossible to optimize.

RubenKelevra commented 4 years ago

Well, that's neat! :)

I think optimizations depend on the filesystem, you could for example add a fast SSD, like the Intel Optane ones, for the cache in ZFS or for 'small files'-vdev.

This should give a major boost in read performance.

The advantage of something like ZFS is clearly that it can rollback to a clean state in case of an power outage, even with write sync off. So writes can be accepted as a bulk and slowly committed to the storage in an orderly manner, when the device is not delivering read requests - like a raid controller write cache - just with basically the whole free main memory.

willscott commented 4 years ago

Is there a middle ground where flat-fs remains as default for low-power configuration, and badger becomes default otherwise?

Maybe it would be useful to specify some of the target levels for excessive space / memory / compacting behavior that would be acceptable to make the switch.

Stebalien commented 4 years ago

Yes, but I don't actually think we have to. Once we fix the final memory issue (shouldn't be too hard), low power nodes should be just fine.

dokterbob commented 3 years ago

Within this context I would like to point out the following issue, a huge problem regarding garbage not being collected: https://github.com/ipfs/go-ds-badger/issues/54#issuecomment-819866098

Stebalien commented 3 years ago

Yep, I've added that to the list. Unfortunately, IIRC, badger v2 had its own issues so we're on to badger v3 now.

dokterbob commented 2 years ago

May I inform on the progress on this one?

godcong commented 2 years ago

GC doesn't actually work in v1: https://github.com/ipfs/go-ds-badger/issues/54#issuecomment-819866098

v1 seems to have stopped updating, and no one will probably ever address it in this entry. Isn't it better to try another plan, like using v2 or v3 or something.

guseggert commented 2 years ago

We will not make v1 the default as it is unmaintained, and v2 has issues as @Stebalien pointed out. DGraph, the company that maintains BadgerDB, is undergoing a leadership shake-up, so we're hesitant to make v3 the default until we are confident that v3 will be maintained in the long run.

dokterbob commented 2 years ago

@guseggert Thanks for the update. Any news on this, by now? Are you looking into alternative datastores?

Jorropo commented 2 years ago

Are you looking into alternative datastores?

flatfs

I want to write an LSM datastore with reflinking, but I wont work on this before #8201 is a thing.

ipfs / kubo

Make badger-ds the default datastore #4279