timshannon / badgerhold

BadgerHold is an embeddable NoSQL store for querying Go types built on Badger
MIT License
515 stars 52 forks source link

Performance considerations and optimizing a structure #19

Closed davedbase closed 4 years ago

davedbase commented 4 years ago

I've been passively working with BadgerHold and I really like the simplicity of it. I've managed to migrate a MySQL data structure to a simplified version. My structure looks similar to this now:

// Document represents the main document structure
type Document struct {
    ID        uint64 `badgerhold:"key"`
    Type      string `badgerholdIndex:"Type"`
    Title     string
    Content   string
    Excerpt   string
    Reference string
    Status    string
    Created   time.Time
    Updated   time.Time
}
// Property represents meta value for a document
type Property struct {
    Document uint64 `badgerholdIndex:"Document"`
    Key      string `badgerholdIndex:"Key"`
    Value    interface{}
}

I have about 1100 Document types and about 15,000 Property types in Badger. There are no subfields and not much complexity in general. The Badger file size is 273mb. When I run the following query I'm getting what I believe may not be optimal performance:

speakers := []Document{}
store.Find(&speakers, badgerhold.Where(badgerhold.Key).Eq(uint64(1128)))

I can run that query in 31.644858ms. An equivalent run on a similar size RDBS (same exact dataset) would run in 0.0007s (considerable difference). I'm wondering if I'm not using this is the best way possible, perhaps I'm making a mistake or there could be another performance blocker. I would hope that Badgerhold would at least perform as fast or faster than an SQLite implementation? I may be wrong with that expectation.

Mind providing some insight? I'd be willing to share my dataset privately if you'd like to take a peek out of curiosity.

timshannon commented 4 years ago

Yeah, I would not expect anywhere near the same performance you'd expect from either mysql or sqlite, two databases that have had decades of efficiencies written into them.

That being said your query should be much faster, not only because you are looking up on the Key but you are looking for a single key.

Badgerhold does absolutely no query optimization. It gets the data in exactly the way you specify in your query. You even have to specify your index. Mysql and SQLITE parse out the query, and use some very complicated heuristics to find the most efficient way to look up what you are asking for in your query.

The long and the short of it is that your query above would be a lot faster if you just did a Get on your key

speakers := []Document{}
store.Get(uint64(1128), &speakers)

The difference between the two is that your query loops through each key and checks if it matches the criteria, where as this second query just goes straight to the one key and pulls back the data.

Now, I could change the query engine to recognize that specific query and do a Get in the background. That would be the first steps on building a query optimizer, which is something that I feel is way outside of the scope of this little project.

The advantage of not having any query optimization is that the performance is very deterministic. Anyone who's written a complicated query in SQL Server will know the frustration of having your query run in 100 ms one day, and 30 minutes the next because the statistics powering the query plan optimizer changed underneath you in the background.

davedbase commented 4 years ago

Thanks. I agree from a query analysis perspective. The query plan is optimised itself whereas in this implementation (or at least my own query) it relies on looping each record. I assumed even with an index that the performance would still be a bit better.

Thanks for the advice and guidance. I appreciate it.

timshannon commented 4 years ago

I assumed even with an index that the performance would still be a bit better.

Exactly. If you index on the most unique value, then you'll save on thousands of reads off of disk.