Closed evgmik closed 3 years ago
(the comment was updated 2021-02-07 with correct numbers)
Storage efficiency of badger db for different versions.
I modified the script posted https://github.com/dgraph-io/badger/issues/686#issue-399905584
The script generates 100,000 key:value pairs with each key and each value being 10bytes.
package main
import (
"github.com/dgraph-io/badger"
"log"
"fmt"
)
func main() {
opts := badger.DefaultOptions("/tmp/badger")
db, err := badger.Open(opts)
if err != nil {
log.Fatal(err)
}
defer db.Close()
wb := db.NewWriteBatch()
defer wb.Cancel()
for i := 0; i < 100000; i++ {
key := fmt.Sprintf("key_%06d", i)
val := fmt.Sprintf("val_%06d", i)
err := wb.Set([]byte(key), []byte(val)) // Will create txns as needed.
if err != nil {
log.Fatal(err)
}
}
wb.Flush()
fmt.Println("Bulk upload done")
}
The script generates 10,000 key:value pairs with each key and each value being 10bytes. So the minimal storage space is 2MB = 2,000,000 bytes.
With badger v1.6.2
1.3s total execution time. I got factor 4 more HDD space allocation.
$ l /tmp/badger
total 8.3M
4.8M 2021-02-06 21:00 000000.vlog
3.5M 2021-02-06 21:00 000002.sst
116 2021-02-06 21:00 MANIFEST
With badger
latest version (not sure about this) 0.6s total execution time. I got factor of 0.7 more HDD space allocation.
$ l /tmp/badger
total 1.4M
1.4M 2021-02-06 21:32 000001.sst
20 2021-02-06 21:32 000001.vlog
1.0M 2021-02-06 21:32 DISCARD
28 2021-02-06 21:32 KEYREGISTRY
30 2021-02-06 21:32 MANIFEST
Note that DISCARD occupies only 4kB (I guess it empty but preallocated), this is why ls
shows 1MB
So badger DB
overhead is reasonable with newer version. In fact it is "negative", i.e. DB takes less space than the raw data. I am guessing this is due to compression which is new feature in badger
since v2.0.0. Also new version is about factor of 2 faster (at least with this simple script)
So badger DB overhead is reasonable with newer version. In fact it is "negative", i.e. DB takes less space than the raw data. I am guessing this is due to compression which is new feature in badger since v2.0.0. Also new version is about factor of 2 faster (at least with this simple script)
Those are some lovely results. You want to have a go on updating it then? Glad to hear we have a relative simple pathway out of the inefficient state. Also I have some badger instances at work where I should update too, thanks for saving me time there. :smirk:
I will try to upgrade our code to newer badger
, I think relevant API did not change. But I need help with go mod
infrastructure. When I set
require (
github.com/dgraph-io/badger v3.2011.1
)
go build
fails. So I clearly do not understand something about modules infrastructure. Any hints?
What's the error?
You usually do not modify go.mod
directly, but use go get github.com/dgraph-io/badger@v3.2011.1
But I guess since v3 is incompatible, they implemented as new package, so you gonna have to do something like go get github.com/dgraph-io/badger/v3
.
Hmm, this how I used to do it: modified go.mod
to force a particular version of the package. This how we upgraded fuse
and go build
did the heavy lifting. But with badger
this does
not to work. I am playing with my patched version of badger-cli
and using the following in go.mod
module main.go
module github.com/bah2830/badger-cli
go 1.14
require (
github.com/dgraph-io/badger v3.2011.1
github.com/pelletier/go-toml v1.6.0 // indirect
github.com/spf13/afero v1.2.2 // indirect
github.com/spf13/cobra v0.0.5
github.com/spf13/jwalterweatherman v1.1.0 // indirect
github.com/spf13/pflag v1.0.5 // indirect
github.com/spf13/viper v1.6.1
golang.org/x/sys v0.0.0-20191210023423-ac6580df4449 // indirect
golang.org/x/text v0.3.2 // indirect
gopkg.in/yaml.v2 v2.2.7 // indirect
)
go build
returns:
go: errors parsing go.mod:
/home/src/my_src/badger-cli/go.mod:6: require github.com/dgraph-io/badger: version "v3.2011.1" invalid: module contains a go.mod file, so major version must be compatible: should be v0
or v1, not v3
If I replace v3.2011.1
with v1.6.2
or v1.6.0
everything work as expected.
Looks like I find the way, in go.mod
it should be
github.com/dgraph-io/badger/v3 v3.2011.1
note double use of v3
Because of v3
the import statements in *.go
should be replaced to
import (
...
badger "github.com/dgraph-io/badger/v3"
...
)
Why did they treat v2
and above differently is beyond me. Conceptually, it is the same as v0
or v1
.
Now I am ready to battle test it.
Hmm, this how I used to do it: modified go.mod to force a particular version of the package. This how we upgraded fuse and go build did the heavy lifting.
Yes, it works - just saying it's not the usual way. go get
will immediately take action and report errors, that's why I would prefer it.
Why did they treat v2 and above differently is beyond me. Conceptually, it is the same as v0 or v1.
That's a rule of go versioning. When you introduce API breaking changes (which v2 and v3 did) you're supposed to increment the major version and make the new version available under a new import. See also this official blog post.
Now I am ready to battle test it.
Have fun!
The PR for new badger version is ready. Couple observations, the actual hdd space usage is about the same, the built in compression does not do much with our values types (mostly hashes which hard to compress).
The raw listing of DB repo might looks scary: there are a few quite large files, but they are sparse and take no space on modern FS.
Here is somewhat sad observation (which is true with badger v1.6.2
as well). Creation of 100 files (a few bytes long), generates DB of size 1.5MB on a HDD. According to badger-cli
the raw space required in only about 100kB. So the size amplification is 15. This is not too bad. But creation of 1000 files, leads to a DB with 125MB, while the raw size is about 1MB. So we have write amplification of 150. This true for old and new DB versions. I am not sure what to blame for this drastic increase of required space.
The test was generated with for i in $(seq 0 1000); do date +%N | ali stage --stdin /t2$i; done
Also creation of a file takes awfully long time of about 0.5seconds. 1000files/226 seconds. My /tmp
is on a real spinning hdd. Most likely, the slow down in the gateway communication, but this is painfully slow.
A bit more stats on write amplification: Adding 2000 small files with the above script. Resulted in 2 MB raw space, and 500MB hdd usage. I.e. write amplification is 250 times. Something is very fishy here.
I have a good news. After garbage collector (which need to be run several times) the 130 MB DB is compacted to 712kB storage size, which is smaller than the required raw size of 871 kB.
Bad news is that garbage collection clean up kicks in only on db.Close()
, it also needs to be done iteratively several times: GC, Close, repeat. So I see no easy way to do it on a running brig
. Technically we can lock DB, and rinse it and than reopen but this would require some juggling. A sample of the procedure is shown at
https://github.com/evgmik/badger-cli/blob/c7c0ad655817ee2916c8f05c1e408645afd8a2e5/cmd/list.go#L43
@sahib would you be able to add it to our brig GC
rutines, you know better how to set mutex correctly.
It would be nice to do GC without putting brig
offline.
The reason for DB growth seems to be in rewriting certain keys with newer version on every DB operation. So older versions just sit in DB until GCollected. I think some of these keys should not be touched at all, so we might improve space handling but being more careful with what goes to DB. It seems that tree./filename
is updated way to often but I need to check this.
Thanks for the digging & update. The numbers look good.
Did you see those methods? I wonder if they can be used to call GC on a live badger database:
Close()
)Note that RunValueLogGC
apparently is supposed to be called several times, because only one log file is compacted on each call. If those don't work I can have a try to force garbage collection over close. The pattern would be likely to not access the database handle directly, but to implement a getter that will lock every N access and then do a GC run. I would be a little worried though that this might hit performance quite a bit, due to deleting caches every now and then.
The reason for DB growth seems to be in rewriting certain keys with newer version on every DB operation. So older versions just sit in DB until GCollected. I think some of these keys should not be touched at all, so we might improve space handling but being more careful with what goes to DB. It seems that tree./filename is updated way to often but I need to check this.
That theory seems sound, but I don't consider rewriting the same key often as misbehavior. Sure, not doing this is a workaround, but the underlying database layer should allow this.
Did you see those methods? I wonder if they can be used to call GC on a live badger database:
* https://pkg.go.dev/github.com/dgraph-io/badger/v3#DB.Flatten (that's apparently what's also called on `Close()`)
Yes, I played with the flatten
a bit, it seems to keep the DB file size unchanged. I.e. it does not lead to GC.
* https://pkg.go.dev/github.com/dgraph-io/badger/v3#DB.RunValueLogGC
Note that
RunValueLogGC
apparently is supposed to be called several times, because only one log file is compacted on each call.
Yes, I noticed it. I do not know what they were smoking to make this decision. Given that this is happening on Close
, it does not decrease any important metric, and make the whole thing probabilistic. Also, GC
is not greedy, it does minimal required clean up, or does not do anything at all. The workaround might be to increase vlog size from 10MB (which I thought would be enough for everything :). So all values (obsolete and new) are in the same file. Thus GC
collector call would hit one and only file. Though some values sit together with keys, in SST tables. So there are more than 1 file in DB.
If those don't work I can have a try to force garbage collection over close. The pattern would be likely to not access the database handle directly, but to implement a getter that will lock every N access and then do a GC run. I would be a little worried though that this might hit performance quite a bit, due to deleting caches every now and then.
GC
, close, repeat cycle is fairly fast. It takes about 8 second to trim from 220MB to 1MB. We might have some logic which monitors raw size vs consumed size ratio, and enable GC
when it hits certain water mark.
The reason for DB growth seems to be in rewriting certain keys with newer version on every DB operation. So older versions just sit in DB until GCollected. I think some of these keys should not be touched at all, so we might improve space handling but being more careful with what goes to DB. It seems that tree./filename is updated way to often but I need to check this.
That theory seems sound, but I don't consider rewriting the same key often as misbehavior. Sure, not doing this is a workaround, but the underlying database layer should allow this.
I agree, it is not like we asking too much from DB, we have only about 5 keys per file entry.Any reasonable DB should be able to work with it, even if we rewriting keys.
I see. I'll try to implement it as one of the next items. Might take a bit though.
If you want to have another go, here's what needs to be done:
db.mu
lock and then close the database & re-initialize it. While holding the mutex all other methods will block until the GC is done. From the looks that seems to be all - besides testing.* [Here](https://github.com/sahib/brig/blob/39e9d6092c25fcd4e8aff46b0f5b551995f28f28/catfs/db/database_badger.go#L59) we need to hold the `db.mu` lock and then close the database & re-initialize it. While holding the mutex all other methods will block until the GC is done. From the looks that seems to be all - besides testing.
I did it in 77b7a7e
* The GC loop should be probably changed from a fixed interval to a number of writes instead. So when writes > 1000, do a GC or something like that. Probably needs a bit fiddling around.
I decided to leave it on timer, but it monitors if the DB was modified (flushed).
I once get a message from the client about unable to read STATUS from DB, I am guessing DB was closed at this instance. I have no clue how did db.Get
penetrate through mutex
. But I cannot reproduce it reliably, and error disolved on next client call.
Wonderful news that 150MB DB was compacted to 700kB on the flight.
By the way the patch would work with old badger
too
For the
brig
database we are currently usingbadger
v1.6.2 and there are thoughts to move to a more modern verstion 3.2011.1 (at the time of writing). But first we should think isbadger
is the right DB.badger
use different storage for key and value, which allows to do manipulation with keys very fast.badger
is the fastest according to their benchmark. But this kind of DB benefit the most when keys are short and values are large. This is technically not the case for thebrig
structure, our keys are hashes which are around 50 bytes, and keys are quite short or comparable. On one test instance the raw (absolute minimum without compression) size of the keys is 9151 B and values 16568. So a typical value twice larger than the key. More importantly, the values are small 1-2kB. Sobadger
which was designed for big data brings us no benefits. In fact we have huge data write size amplification due tobadger
internal structure. In above case you would need 9151+16568 = 25719 of raw storage, yetbadger v1.6.2 db
takes 517kB (all it stores about 100 commits of the same filename with a different content). So we see write amplification of 20 times. The raw size could be even smaller, if we use more compact values encoding. Also we see, that a simplebrig touch newfile
takes 3-4kB of DB storage.Sure, we all have GB of HDD but there is clearly an inefficiency.