logv / sybil

columnar storage + NoSQL OLAP engine | https://logv.org
https://logv.org
Other
305 stars 26 forks source link

parallelising digestion #106

Closed gouthamve closed 4 years ago

gouthamve commented 4 years ago

Currently I am writing around 1.2K records a second and the digestion is taking 5-10secs sometimes. I am writing the records in batches of 256, i.e, each ingest call has 256 records. I am running with -skip-compact and am running digestion in a background routine every 2secs. I want to increase the ingestion rate to 10K records a second and am worried that the digestion might not be able to keep up.

Any suggestions on how I can improve ingest rate?

okayzed commented 4 years ago

What does a typical record schema look like? The ingestion/digestion rate will be a function of how much data is in those 1.2K samples

I will need to look into this question (parallelizing digestion) some more, I believe the digestion process use locks to prevent multiple digestions from happening at once and corrupting the DB files.

The ingestion rate is usually just dropping a row form file into the ingest/ directory, so the bottleneck should be the digestion phase. If you want to dig in to where the time is going during digestion, you can build sybil with profiling info (make profile) and then examine the perf logs using the go perf tools.

a digestion usually works like this:

If you are digesting a partial block multiple times (say you digest at 5k, 10k, 15k, etc), you are going to run into inefficiency due to redundant work. It would be better to digest at 20, 40 and 60K (3 times instead of 13 times)

One lever that can be adjusted is: how big a block is - the smaller the block, the less time to compact it. But I'm surprised a digest is taking 5 - 10 seconds, it likely means there is a lot of data in the block (if you can give me output of ls -l on a block and the -debug output from running sybil digest -debug -table <foo>, that will also help

okayzed commented 4 years ago

Based on the comment on the other issue, I think moving to set columns will help with digestion speed, but it will really depend on the shape of your data. I recommend trying set columns out and seeing how fast digestion process is compared to your current scheme.

I saw redbull - cool repo! ( and nice usage of sybil :)

okayzed commented 4 years ago

113 is about speeding up huge tables. as part of it, the number of lstats calls was reduced and now digestion for large tables should be much faster