logv / sybil

columnar storage + NoSQL OLAP engine | https://logv.org
https://logv.org
Other
305 stars 26 forks source link

panic when digesting #105

Closed gouthamve closed 4 years ago

gouthamve commented 4 years ago

Hey everyone, merry christmas! I'm trying to use sybil in a write heavy scenario (as an index for https://github.com/jaegertracing/jaeger/) and got quite far. I'm flushing 256 records at a time and am doing about 1000 records/sec. I'm skipping digestion on ingest and am doing the digest in a background routine running every 2 seconds.

I'm seeing this panic:

2019/12/24 20:15:34 WRITING PID 7595 TO LOCK /data/sybil-db/jaeger/info.lock
2019/12/24 20:15:34 LOCKING /data/sybil-db/jaeger/info.lock
2019/12/24 20:15:34 OPENING TABLE INFO FROM FILENAME /data/sybil-db/jaeger/info.db
2019/12/24 20:15:34 UNLOCKING /data/sybil-db/jaeger/info.lock
2019/12/24 20:15:34 WRITING PID 7595 TO LOCK /data/sybil-db/jaeger/stomache.lock
2019/12/24 20:15:34 LOCKING /data/sybil-db/jaeger/stomache.lock
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_095872863.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_502198671.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_762224642.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_582140463.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_214818442.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_113069503.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_482631791.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_418823239.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_351887204.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_071346249.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_470188504.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_199584949.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_097573806.db
2019/12/24 20:15:34 LOADED 256 FOR DIGESTION FROM /data/sybil-db/jaeger/stomache817963248/ingest_942070294.db
panic: runtime error: index out of range [240] with length 240

goroutine 1 [running]:
github.com/logv/sybil/src/lib.SavedRecord.toRecord(0xc005588300, 0x4, 0x4, 0xc005576900, 0xf, 0xf, 0x0, 0x0, 0x0, 0xc0000c4000, ...)
    /home/goutham/go/src/github.com/logv/sybil/src/lib/row_store.go:54 +0x995
github.com/logv/sybil/src/lib.(*Table).LoadRecordsFromLog(0xc0000c4000, 0xc00539ed80, 0x3b, 0xc00539ed80, 0x3b, 0x13)
    /home/goutham/go/src/github.com/logv/sybil/src/lib/row_store.go:132 +0x208
github.com/logv/sybil/src/lib.(*Table).LoadRowStoreRecords(0xc0000c4000, 0xc000116166, 0x11, 0xc0031a1d70)
    /home/goutham/go/src/github.com/logv/sybil/src/lib/table_ingest.go:176 +0x441
github.com/logv/sybil/src/lib.(*Table).DigestRecords(0xc0000c4000)
    /home/goutham/go/src/github.com/logv/sybil/src/lib/table_ingest.go:324 +0x489
github.com/logv/sybil/src/cmd.RunDigestCmdLine()
    /home/goutham/go/src/github.com/logv/sybil/src/cmd/cmd_digest.go:27 +0x233
main.main()
    /home/goutham/go/src/github.com/logv/sybil/main.go:95 +0xcc

This is with the latest master, it fails again for the next few digest attempts but later succeeds. I am curious what might be causing it. Also digestion is taking upto 10secs, is there anything I can optimise to make it faster?

Thanks a ton for this project!

okayzed commented 4 years ago

I see the line that is erroring, but don't understand why just yet. My initial guess is that it looks like a new int column is in the Record but its not yet seen in the table's KeyTable (how this happens is unknown - I will have to investigate and test my current assumptions)

An easy mitigation is to resize Populated if we see any IDs larger than current max_int_key. You can try out master to see if this interim fix works.

okayzed commented 4 years ago

My question is: how are new columns being added to the table? Are your samples dynamically getting new columns on each ingestion?

I have ideas for mitigating, but it is a little weird to me that its happening. I would like to repro it and fix it properly.

gouthamve commented 4 years ago

Are your samples dynamically getting new columns on each ingestion

I am not sure but I think so. https://github.com/gouthamve/redbull/blob/master/sybil.go#L171-L197

So spans have tags and different spans could have a different set of tags. And I am creating a column for each tag. This could mean that in 10K rows, there could be only one row with the column veryUniqueTag. Is that problem?

okayzed commented 4 years ago

Depending on what you are doing with those tags, a set column might be appropriate. If you aren't using the UI, you can add the tags as a set column by passing an array of string values to sybil. The set will show up when you run sybil query -table -info and you'll be able to use 'in' and 'nin' with -set-filter on the set fields, like `-set "tags:in:foo". The set field will show up in the samples query as an array, but it can not be grouped by yet in other queries.

Now that I know the scenario a bit better, I can investigate how to properly fix it

okayzed commented 4 years ago

The advantage/disadvantage of the two options:

One dynamic column per tag:

One set column for tags:

I will pursue making the dynamic column generation safer

okayzed commented 4 years ago

Adding -save-srb (from a1d2f05be92ebce9a8e9d00013a88c54173f2c18 in master) should help alleviate this issue. (crossing fingers)

okayzed commented 4 years ago

I believe save-srb worked here for tables that have lots of dynamic columns being added. I'm debating whether to turn it on always or not, but will add information to the main README about this gotcha