inhindsight / hindsight

Apache License 2.0
12 stars 8 forks source link

Persist Performance #197

Open ManApart opened 4 years ago

ManApart commented 4 years ago

Given 10 million small messages, Persist processed them at around 2,000 messages a second. This is 1/5 the speed of receive. While Persist does not bottleneck the system, this does mean it will get backed up over time if it can't keep up with other parts of the system. This also could reflect on the number of presto workers etc

AC

Tech Notes

jeffgrunewald commented 4 years ago

Persist shouldn’t be writing via presto at all anymore. It’s a direct write to S3 as json and then select * from json_stage to orc_permanent

ManApart commented 4 years ago

Isn't it presto that's doing that table copy?

jdenen commented 4 years ago

It is a presto query that moves staged data to the permanent table.

jessie-morris commented 4 years ago

And per brian this happens for every batch. So while we don't write an insert statement that inserts 1MB of rows, we write 1MB of rows, then run an insert into select from of the staging table, which practically means presto is hit almost as many times per my understanding.