Write performance - Githubissues

cbrake commented 4 years ago

I've been using bolthold on an embedded Linux system (eMMC storage). I'm noticing that as the DB grows, the write performance falls off linearly.

bolthold-sample-vs-insert-time

I'm using an increasing timestamp for the key, so I would think that would be sequential VS random access.

Bellow is the insert code:


// Sample represents a value in time and should include data that may be
// graphed.
type Sample struct {
    // Type of sample (voltage, current, key, etc)
    Type string `json:"type,omitempty" boltholdIndex:"Type" influx:"type,tag"`

    // ID of the device that provided the sample
    ID string `json:"id,omitempty" influx:"id,tag"`

    // Average OR
    // Instantaneous analog or digital value of the sample.
    // 0 and 1 are used to represent digital values
    Value float64 `json:"value,omitempty" influx:"value"`

    // statistical values that may be calculated
    Min float64 `json:"min,omitempty" influx:"min"`
    Max float64 `json:"max,omitempty" influx:"max"`

    // Time the sample was taken
    Time time.Time `json:"time,omitempty" boltholdKey:"Time" gob:"-" influx:"time"`

    // Duration over which the sample was taken
    Duration time.Duration `json:"duration,omitempty" influx:"duration"`

    // Tags are additional attributes used to describe the sample
    // You might add things like friendly name, etc.
    Tags map[string]string `json:"tags,omitempty" influx:"-"`

    // Attributes are additional numerical values
    Attributes map[string]float64 `json:"attributes,omitempty" influx:"-"`
}

// DataMeta is used to store meta information about data in the database
type DataMeta struct {
    SampleCount int
}

// WriteSample writes a sample to the database
// Samples are flow, pressure, amount, etc.
func (db *IsDb) WriteSample(sample data.Sample) error {
    dataMeta := DataMeta{}
    err := db.store.Get(0, &dataMeta)
    if err != nil {
        // attempt to init metadata
        _, err = db.GetSampleCount()
        if err != nil {
            return err
        }
    }
    err = db.store.Insert(sample.Time, sample)
    if err != nil {
        return err
    }

    dataMeta.SampleCount++
    return db.store.Upsert(0, &dataMeta)
}

Once I get to 100,000 samples or so, the performance is really slow (2+ seconds to insert a sample). I'm thinking something is not quite right, as I read about people using multi TB bolt databases, but it seems with my use case, there is no way this could work.

I tried setting FreelistType to FreelistMapType -- that did not seem to make any difference.

Appreciate any thoughts is this normal, or can this be optimized.

Cliff

timshannon commented 4 years ago

Do you get the same performance drop off if you don't use an index? If there is no index handling in the insert, then the performance should be the exact same as encoding time + normal bolt insert time.

cbrake commented 4 years ago

much flatter without index:

bolthold-sample-vs-insert-time-without-index

So, I guess with timeseries data, you don't really want to use an index because the index is huge.

Another way to do this might be to put each sample type in its own bucket.

Or, there may be a more efficient way to implement an index -- perhaps a separate bucket for each sample Type, and each index is a separate record in the bucket -- then adding records would be fast? Databases are fun to think about -- lots of tradeoffs to be made.

Thanks for the help!

timshannon commented 4 years ago

You should still be able to use indexes on time series data, but what I'm guessing is happening is that your index on "tag" might not be very unique. It's usually a good idea to have fairly unique values in indexes, however in a regular database it shouldn't impact performance that drastically during inserts.

However, what I do with indexes in bolthold is a pretty naive implementation. I simply store then entire index under one key value, so the less unique the index, the more and more gets stored (and thus decoded, and encoded) on each insert. I'm guessing that's what's happening with your scenario here.

I can make my index handling more like a "real" database, and split the values across multiple keys, but it'll take quite a bit of reworking.

I'll open an issue for that. I appreciate you bringing this up.

cbrake commented 4 years ago

yes, I'm using a small # of Types relative to the # of samples -- maybe 6 or so, so they are not very unique.

cbrake commented 4 years ago

one more note -- without an index, and with 500,000 samples in DB, the insert time is still ~50ms/sample -- this is great -- means I can use bolthold to record about any amount of timeseries data on this device. Currently using around 715 bytes/sample -- would like to experiment with protobuf to see if that would be faster/more efficient.

nicewook commented 4 years ago

Your discussion helped me a lot. Do you think how many fields use index also affect the performance? hm... I need query for the logs with start/end date to 1,000,000 logs. so I need index. Can I ask you any suggestion?

timshannon commented 4 years ago

Having many indexes will definitely impact performance of inserts and updates, because those indexes will need to be maintained on every insert and update.

I wouldn't recommend putting an index on a date/time if you can help it. Go Time structs are very accurate, so you'll end up with very a non-unique index.

If you have start date and end date as fields, I would recommend having start as your key value, and always querying with the start date.

cbrake commented 4 years ago

One problem I ran into using the Go Time type as a key is the gob encoded data of Go Time is not always monotonic with time, so seeks to a date would not always work. When I converted a time stamps to int64, and inserted bytes into key in big-endian format, seeks were then very fast and reliable. I may be missing something though, but it seems since Go Time is a struct, the encoded data for it will likely not always be monotonic.

nicewook commented 4 years ago

@timshannon Thank you for your advice.

Use minimum indexes I can afford!
Do I need to use the tag key and index? or just key will work?
I tested with badgerhold - It works much better. but It require a lot of disk size.
I did not get it. could you explain a bit more?

If you have start date and end date as fields, I would recommend having start as your key value, and always querying with the start date.

@cbrake Thanks. I wil l try to use int64 (unix time) so, I get query start/end data as RFC3339 and convert them to uint64. then query to bolthold

timshannon / bolthold

Write performance #105