State partitioning - Githubissues

fountainheadpro commented 9 years ago

I just studied the code and I think it's a great jump forward towards stateful data stream processing.

One of the things to think about is state partitioning. Based on what I see, all the cluster nodes in particular role will be in replication cycle: https://github.com/patriknw/akka-data-replication/blob/master/src/main/scala/akka/contrib/datareplication/Replicator.scala#L1066-L1068

So if I have total 10GB of state to maintain for my stream processing, I will have to allocate 10GB of memory for each node. It would be great to manage number of partitions as well as partition replication factor kafkaesque style. This way I would be able to have 10 partitions, approximately 1GB each. Assuming my replication factor is 3, I would have to allocate only 3GB on each node.

This would need to be combined with partition based routing to make sure each request gets routed to a partition leader.

patriknw commented 9 years ago

That is a good point. This project has so far limited scope to keep it reasonable simple. All data everywhere instead of data partitioning is one of the limitations. In memory data without durability is another.

It is suitable for small data rather than big data.

That said, I find partitioning of the data an interesting future enhancement. Thanks for feedback.

fountainheadpro commented 9 years ago

Thanks for the reply. Could you clarify on "In memory data without durability"? Are you referring to the need to integrate with akka-persistence to enable event sourcing and snapshotting?

patriknw commented 9 years ago

The data is not stored to disk. That means that if all nodes are stopped and the whole cluster restarted it will start up with an empty state.

patriknw / akka-data-replication

State partitioning #59