evgmik commented 3 years ago

For the brig database we are currently using badger v1.6.2 and there are thoughts to move to a more modern verstion 3.2011.1 (at the time of writing). But first we should think is badger is the right DB.

badger use different storage for key and value, which allows to do manipulation with keys very fast. badger is the fastest according to their benchmark. But this kind of DB benefit the most when keys are short and values are large. This is technically not the case for the brig structure, our keys are hashes which are around 50 bytes, and keys are quite short or comparable. On one test instance the raw (absolute minimum without compression) size of the keys is 9151 B and values 16568. So a typical value twice larger than the key. More importantly, the values are small 1-2kB. So badger which was designed for big data brings us no benefits. In fact we have huge data write size amplification due to badger internal structure. In above case you would need 9151+16568 = 25719 of raw storage, yet badger v1.6.2 db takes 517kB (all it stores about 100 commits of the same filename with a different content). So we see write amplification of 20 times. The raw size could be even smaller, if we use more compact values encoding. Also we see, that a simple brig touch newfile takes 3-4kB of DB storage.

Sure, we all have GB of HDD but there is clearly an inefficiency.

evgmik commented 3 years ago

(the comment was updated 2021-02-07 with correct numbers)

Storage efficiency of badger db for different versions.

I modified the script posted https://github.com/dgraph-io/badger/issues/686#issue-399905584

The script generates 100,000 key:value pairs with each key and each value being 10bytes.

package main

import (
    "github.com/dgraph-io/badger"
    "log"
    "fmt"
)

func main() {
    opts := badger.DefaultOptions("/tmp/badger")
    db, err := badger.Open(opts)
    if err != nil {
        log.Fatal(err)
    }
    defer db.Close()

    wb := db.NewWriteBatch()
    defer wb.Cancel()

    for i := 0; i < 100000; i++ {
        key := fmt.Sprintf("key_%06d", i)
        val := fmt.Sprintf("val_%06d", i)
        err := wb.Set([]byte(key), []byte(val)) // Will create txns as needed.
        if err != nil {
            log.Fatal(err)
        }
    }
    wb.Flush()
    fmt.Println("Bulk upload done")
}

The script generates 10,000 key:value pairs with each key and each value being 10bytes. So the minimal storage space is 2MB = 2,000,000 bytes.

With badger v1.6.2 1.3s total execution time. I got factor 4 more HDD space allocation.

 $ l /tmp/badger
total 8.3M
4.8M 2021-02-06 21:00 000000.vlog
3.5M 2021-02-06 21:00 000002.sst
 116 2021-02-06 21:00 MANIFEST

With badger latest version (not sure about this) 0.6s total execution time. I got factor of 0.7 more HDD space allocation.

$ l /tmp/badger
total 1.4M
1.4M 2021-02-06 21:32 000001.sst
  20 2021-02-06 21:32 000001.vlog
1.0M 2021-02-06 21:32 DISCARD
  28 2021-02-06 21:32 KEYREGISTRY
  30 2021-02-06 21:32 MANIFEST

Note that DISCARD occupies only 4kB (I guess it empty but preallocated), this is why ls shows 1MB

So badger DB overhead is reasonable with newer version. In fact it is "negative", i.e. DB takes less space than the raw data. I am guessing this is due to compression which is new feature in badger since v2.0.0. Also new version is about factor of 2 faster (at least with this simple script)

sahib commented 3 years ago

So badger DB overhead is reasonable with newer version. In fact it is "negative", i.e. DB takes less space than the raw data. I am guessing this is due to compression which is new feature in badger since v2.0.0. Also new version is about factor of 2 faster (at least with this simple script)

Those are some lovely results. You want to have a go on updating it then? Glad to hear we have a relative simple pathway out of the inefficient state. Also I have some badger instances at work where I should update too, thanks for saving me time there. :smirk:

evgmik commented 3 years ago

I will try to upgrade our code to newer badger, I think relevant API did not change. But I need help with go mod infrastructure. When I set

require (
    github.com/dgraph-io/badger v3.2011.1
)

go build fails. So I clearly do not understand something about modules infrastructure. Any hints?

sahib commented 3 years ago

What's the error?

You usually do not modify go.mod directly, but use go get github.com/dgraph-io/badger@v3.2011.1 But I guess since v3 is incompatible, they implemented as new package, so you gonna have to do something like go get github.com/dgraph-io/badger/v3.

evgmik commented 3 years ago

Hmm, this how I used to do it: modified go.mod to force a particular version of the package. This how we upgraded fuse and go build did the heavy lifting. But with badger this does not to work. I am playing with my patched version of badger-cli and using the following in go.mod module main.go

module github.com/bah2830/badger-cli

go 1.14

require (
    github.com/dgraph-io/badger v3.2011.1
    github.com/pelletier/go-toml v1.6.0 // indirect
    github.com/spf13/afero v1.2.2 // indirect
    github.com/spf13/cobra v0.0.5
    github.com/spf13/jwalterweatherman v1.1.0 // indirect
    github.com/spf13/pflag v1.0.5 // indirect
    github.com/spf13/viper v1.6.1
    golang.org/x/sys v0.0.0-20191210023423-ac6580df4449 // indirect
    golang.org/x/text v0.3.2 // indirect
    gopkg.in/yaml.v2 v2.2.7 // indirect
)

go build returns:

go: errors parsing go.mod:
/home/src/my_src/badger-cli/go.mod:6: require github.com/dgraph-io/badger: version "v3.2011.1" invalid: module contains a go.mod file, so major version must be compatible: should be v0
or v1, not v3

If I replace v3.2011.1 with v1.6.2 or v1.6.0 everything work as expected.

evgmik commented 3 years ago

Looks like I find the way, in go.mod it should be github.com/dgraph-io/badger/v3 v3.2011.1 note double use of v3

Because of v3 the import statements in *.go should be replaced to

import (
...
    badger "github.com/dgraph-io/badger/v3"
...
)

Why did they treat v2 and above differently is beyond me. Conceptually, it is the same as v0 or v1.

Now I am ready to battle test it.

sahib commented 3 years ago

Hmm, this how I used to do it: modified go.mod to force a particular version of the package. This how we upgraded fuse and go build did the heavy lifting.

Yes, it works - just saying it's not the usual way. go get will immediately take action and report errors, that's why I would prefer it.

Why did they treat v2 and above differently is beyond me. Conceptually, it is the same as v0 or v1.

That's a rule of go versioning. When you introduce API breaking changes (which v2 and v3 did) you're supposed to increment the major version and make the new version available under a new import. See also this official blog post.

Now I am ready to battle test it.

Have fun!

evgmik commented 3 years ago

The PR for new badger version is ready. Couple observations, the actual hdd space usage is about the same, the built in compression does not do much with our values types (mostly hashes which hard to compress).

The raw listing of DB repo might looks scary: there are a few quite large files, but they are sparse and take no space on modern FS.

Here is somewhat sad observation (which is true with badger v1.6.2 as well). Creation of 100 files (a few bytes long), generates DB of size 1.5MB on a HDD. According to badger-cli the raw space required in only about 100kB. So the size amplification is 15. This is not too bad. But creation of 1000 files, leads to a DB with 125MB, while the raw size is about 1MB. So we have write amplification of 150. This true for old and new DB versions. I am not sure what to blame for this drastic increase of required space.

The test was generated with for i in $(seq 0 1000); do date +%N | ali stage --stdin /t2$i; done Also creation of a file takes awfully long time of about 0.5seconds. 1000files/226 seconds. My /tmp is on a real spinning hdd. Most likely, the slow down in the gateway communication, but this is painfully slow.

evgmik commented 3 years ago

A bit more stats on write amplification: Adding 2000 small files with the above script. Resulted in 2 MB raw space, and 500MB hdd usage. I.e. write amplification is 250 times. Something is very fishy here.

evgmik commented 3 years ago

I have a good news. After garbage collector (which need to be run several times) the 130 MB DB is compacted to 712kB storage size, which is smaller than the required raw size of 871 kB.

Bad news is that garbage collection clean up kicks in only on db.Close(), it also needs to be done iteratively several times: GC, Close, repeat. So I see no easy way to do it on a running brig. Technically we can lock DB, and rinse it and than reopen but this would require some juggling. A sample of the procedure is shown at https://github.com/evgmik/badger-cli/blob/c7c0ad655817ee2916c8f05c1e408645afd8a2e5/cmd/list.go#L43 @sahib would you be able to add it to our brig GC rutines, you know better how to set mutex correctly. It would be nice to do GC without putting brig offline.

The reason for DB growth seems to be in rewriting certain keys with newer version on every DB operation. So older versions just sit in DB until GCollected. I think some of these keys should not be touched at all, so we might improve space handling but being more careful with what goes to DB. It seems that tree./filename is updated way to often but I need to check this.

sahib commented 3 years ago

Thanks for the digging & update. The numbers look good.

Did you see those methods? I wonder if they can be used to call GC on a live badger database:

https://pkg.go.dev/github.com/dgraph-io/badger/v3#DB.Flatten (that's apparently what's also called on Close())
https://pkg.go.dev/github.com/dgraph-io/badger/v3#DB.RunValueLogGC

Note that RunValueLogGC apparently is supposed to be called several times, because only one log file is compacted on each call. If those don't work I can have a try to force garbage collection over close. The pattern would be likely to not access the database handle directly, but to implement a getter that will lock every N access and then do a GC run. I would be a little worried though that this might hit performance quite a bit, due to deleting caches every now and then.

The reason for DB growth seems to be in rewriting certain keys with newer version on every DB operation. So older versions just sit in DB until GCollected. I think some of these keys should not be touched at all, so we might improve space handling but being more careful with what goes to DB. It seems that tree./filename is updated way to often but I need to check this.

That theory seems sound, but I don't consider rewriting the same key often as misbehavior. Sure, not doing this is a workaround, but the underlying database layer should allow this.

evgmik commented 3 years ago

Did you see those methods? I wonder if they can be used to call GC on a live badger database:
* https://pkg.go.dev/github.com/dgraph-io/badger/v3#DB.Flatten (that's apparently what's also called on `Close()`)

Yes, I played with the flatten a bit, it seems to keep the DB file size unchanged. I.e. it does not lead to GC.

* https://pkg.go.dev/github.com/dgraph-io/badger/v3#DB.RunValueLogGC
Note that RunValueLogGC apparently is supposed to be called several times, because only one log file is compacted on each call.

Yes, I noticed it. I do not know what they were smoking to make this decision. Given that this is happening on Close, it does not decrease any important metric, and make the whole thing probabilistic. Also, GC is not greedy, it does minimal required clean up, or does not do anything at all. The workaround might be to increase vlog size from 10MB (which I thought would be enough for everything :). So all values (obsolete and new) are in the same file. Thus GC collector call would hit one and only file. Though some values sit together with keys, in SST tables. So there are more than 1 file in DB.

If those don't work I can have a try to force garbage collection over close. The pattern would be likely to not access the database handle directly, but to implement a getter that will lock every N access and then do a GC run. I would be a little worried though that this might hit performance quite a bit, due to deleting caches every now and then.

GC, close, repeat cycle is fairly fast. It takes about 8 second to trim from 220MB to 1MB. We might have some logic which monitors raw size vs consumed size ratio, and enable GC when it hits certain water mark.

The reason for DB growth seems to be in rewriting certain keys with newer version on every DB operation. So older versions just sit in DB until GCollected. I think some of these keys should not be touched at all, so we might improve space handling but being more careful with what goes to DB. It seems that tree./filename is updated way to often but I need to check this.

That theory seems sound, but I don't consider rewriting the same key often as misbehavior. Sure, not doing this is a workaround, but the underlying database layer should allow this.

I agree, it is not like we asking too much from DB, we have only about 5 keys per file entry.Any reasonable DB should be able to work with it, even if we rewriting keys.

sahib commented 3 years ago

I see. I'll try to implement it as one of the next items. Might take a bit though.

If you want to have another go, here's what needs to be done:

Here we need to hold the db.mu lock and then close the database & re-initialize it. While holding the mutex all other methods will block until the GC is done. From the looks that seems to be all - besides testing.
The GC loop should be probably changed from a fixed interval to a number of writes instead. So when writes > 1000, do a GC or something like that. Probably needs a bit fiddling around.

evgmik commented 3 years ago

* [Here](https://github.com/sahib/brig/blob/39e9d6092c25fcd4e8aff46b0f5b551995f28f28/catfs/db/database_badger.go#L59) we need to hold the `db.mu` lock and then close the database & re-initialize it. While holding the mutex all other methods will block until the GC is done. From the looks that seems to be all - besides testing.

I did it in 77b7a7e

* The GC loop should be probably changed from a fixed interval to a number of writes instead. So when writes > 1000, do a GC or something like that. Probably needs a bit fiddling around.

I decided to leave it on timer, but it monitors if the DB was modified (flushed).

I once get a message from the client about unable to read STATUS from DB, I am guessing DB was closed at this instance. I have no clue how did db.Get penetrate through mutex. But I cannot reproduce it reliably, and error disolved on next client call.

Wonderful news that 150MB DB was compacted to 700kB on the flight.

By the way the patch would work with old badger too

evgmik commented 3 years ago

96 implements the upgrade from badger v1.6.2 to v.3.2011.1

sahib / brig

New database for brig #92

96 implements the upgrade from badger v1.6.2 to v.3.2011.1