sclasen / akka-kafka

185 stars 62 forks source link

Specify receiver as Props instead of ActorRef? #15

Closed tarmath closed 10 years ago

tarmath commented 10 years ago

Hello!

Maybe there's something I don't understand as I am fairly new to the world of Akka, but wouldn't it be wiser to supply the helper methods with a way to instantiate the Actor that processes the messages, so that each StreamFSM actor can create their own rather than all send to the same ActorRef, thus creating possibly some sort of bottleneck, wherever message processing can be a long-running process (intensive or not)? (such as for example submitting this message to a third party service and waiting for the answer).

The way I map this right now, it feels like the parallelism or having multiple StreamFSM is wasted on the fact that the messages are processed by the same ActorRef in the end?

Thanks for your great work!

sclasen commented 10 years ago

Hi @tarmath

If your processing actor(s)are parallelizable, then the actorRef you use with akka-kafka could certainly be a Router backed by several underlying processor actors.

Akka benchmarks show that a single actor can process 10s of millions of messages per second (at least), so you have to be going pretty fast to botteneck there.

In many cases you will want to have a single actor that manages some state. Imagine the simplest case where your processing actor simply counts messages recieved var count:Int = 0. Using a single actor here makes the problem trivial. (Correctly) aggreagting state across a set of processing actors to calculate the total count is less so.

Make sense?

That said, I have considered having a way to allow for specifying a processing actor per StreamFSM to allow for ordered processing (where each processing actor will see the messages in order from the stream/kafka partition(s) it is associated with)

tarmath commented 10 years ago

Using a router would certainly solve my problem here, though it also sounds a little less "parallel" to me, as there is only one router taking in the messages from multiple StreamFSM. At any rate, and as you mentioned, it would take a lot of messages for the router to be overwhelmed while I can create more processors as needed, which is really where the slow processing happens (not due to akka but to the nature of my problem).

As for the issue where you would need to count the messages received, wouldn't that work trivially if you declare streams = 1 since you don't want parallelism in your receiver anyway? I guess I don't see the advantage of using multiple StreamFSM if it all ends up in the same Receiver? Doesn't it break the idea that having multiple Kafka streams is typically needed when the processing is parallelizable?

Finally, if you do end up making it possible to have a processing Actor per Stream, I would definitively use that feature!

Thanks again for your quick feedback and your suggestion!

sclasen commented 10 years ago

@tarmath one thing to note about number of streams...

If your consumer has less streams (in aggregate across the consumer group) than the number of partitions in the topic, then the underlying kafka consumer won't be receiving messages from all topics, but will instead switch the set of partitions it is consuming from every 10 minutes (by default)

There is more doc on that behavior (if you look closely enough ;). At the kafka website