Persist throughput not scaling with number of persistent actors

tsandall commented 10 years ago

I have a simple test setup as follows:

N persistent actors which receive commands, immediately persist, and then reply to sender inside the event handler
M normal actors which send commands every 10ms
1 proxy actor which relays commands between normal and persistent actors and periodically prints the reply rate giving an estimate of the overall persist throughput

I would expect to see overall persist throughput scale linearly with N up to some limit.

Using the Kafka journal I see the overall persist throughput between ~4,000 for N=1 up to ~4,900 for N=100. On the other hand, using the Cassandra journal I see the overall persist throughput scaling as expected, up to ~30,000 for N=100.

It seems like there's a bottleneck somewhere in akka-persistence or akka-persistence-kafka.

My guess is that there's a single actor for the journal and since akka-persistence-kafka uses producer.type = sync there's no parallelism for sends so the overall persist throughput is limited to a single thread sending to Kafka.

Am I doing something stupid? Any ideas?

If my guess is correct it would seem like send parallelism could be achieved by deferring the send operation to a dedicated pool of actors under the Kafka journal actor. Just a thought...

krasserm commented 10 years ago

This is actually a known issue of the Kafka journal actor, I'll fix that soon. One just has to make sure that there's no send concurrency for the same persistenceId, otherwise, events could be stored in the wrong order. Sends for different persistentIds can be made concurrently. Hence, a simple pool will not work. Things will hopefully get much better with Kafka 0.9 which should come with a new async producer API. Thanks for your feedback, really appreciate it.

krasserm commented 10 years ago

@tsandall would you mind sharing your test code?

tsandall commented 10 years ago

The test code is here: https://github.com/tsandall/kafka-journal-exp

krasserm commented 10 years ago

I started to work on a solution on branch wip-concurrent-writes, the related commit is here. I didn't add any specific tests yet. Should you give it a try, please let me know how it works. Thanks!

krasserm commented 10 years ago

Using your test code, I can now see an increase in throughput from factor 1.2 (almost no scale) to 4.5 using the default concurrency settings. I only tested with the embedded TestServer and expect better results with a standalone Kafka server. Further performance and scalability improvements are still on my list, for now I'll cut another release with the latest improvements.

tsandall commented 10 years ago

Updated to 0.3 and verified the scaling improvements.

With the introduction of kafka-journal.write-concurrency does it make sense to change the default-dispatcher to a PinnedDispatcher?

As a side note, the throughput limit seems to be ~15,000 persist/sec on my system. This is compared to ~30,000 persist/sec using the akka-persistence-cassandra plugin. The 15k is good enough for me for now, I'll look forward to performance improvements in the future.

Again, thanks for the quick improvements on the plugin. Looking forward to integrating with it in the near future.

krasserm commented 10 years ago

With the introduction of kafka-journal.write-concurrency does it make sense to change the default-dispatcher to a PinnedDispatcher?

This wouldn't allow concurrent replays at the moment (although this could be solved otherwise). Without having a reasonable load test, it's hard to say whether a dispatcher with n threads (that can either serve writes or replays) or a pinned dispatcher per writer is better. Anyway, I'll experiment with that as I continue working on performance improvements.

As a side note, the throughput limit seems to be ~15,000 persist/sec on my system. This is compared to ~30,000 persist/sec using the akka-persistence-cassandra plugin. The 15k is good enough for me for now, I'll look forward to performance improvements in the future.

Without cluster-sharding and having only a single persistent actor (doing persistAsync that allows for internal batching) the throughput goes up to 70k messages per second. From earlier experiments, I know that cluster-sharding significantly contributes to throughput decrease but I'm not sure if this is still true for the latest version. Furthermore, with your test setup, a different batching strategy that batches messages across persistent actors could further improve throughput. This however likely requires additions to akka-persistence directly.

krasserm / akka-persistence-kafka

Persist throughput not scaling with number of persistent actors #6