spring-projects / spring-statemachine

Spring Statemachine is a framework for application developers to use state machine concepts with Spring.
1.56k stars 611 forks source link

Questions around distributed state machine / zookeeper. #599

Open Milesy opened 6 years ago

Milesy commented 6 years ago

I have been looking at the class ZookeeperStateMachineEnsemble

https://github.com/spring-projects/spring-statemachine/blob/v2.0.2.RELEASE/spring-statemachine-zookeeper/src/main/java/org/springframework/statemachine/zookeeper/ZookeeperStateMachineEnsemble.java

This class appears to support multiple state machines? (join / leave) but it uses a single state context as its reference point

private final AtomicReference stateRef = new AtomicReference();

In which case I have to believe that this class is only suitable for distribution of a single state machine? (where as my factory creates many).

What is "joining" - is it the distributed instances of that single state machine, all identified by a single context?

@jvalkeal

jvalkeal commented 6 years ago

I'd be relatively careful trying to use it as I haven't seen statemachine/zookeeper used that much. Joining is a same term as in zookeeper where machine joins an ensemble and then state is synchronized. If you scroll all the way down in a reference docs you can find a little technical paper I originally wrote describing what happens when machines come and go together with network splits, etc. I think that paper might give you better answers than what I can write here.

Milesy commented 6 years ago

@jvalkeal I'm actually investigating how I might borrow from its concepts to create a distributed system with Pivotal Cache / Gemfire backing it.

At the moment our application is made up of many state machines, all of which are persisted, but we have no way of being able to go hot-hot without having a way to coordinate states between instances.

Looking through the code, I have found a potential issue for me right from the get go:

org.springframework.statemachine.ensemble.DistributedStateMachine.LocalStateMachineInterceptor#postTransition

@Override public void preStateChange(State<S, E> state, Message<E> message, Transition<S, E> transition, StateMachine<S, E> stateMachine) { if (log.isTraceEnabled()) { log.trace("Received preStateChange from " + stateMachine + " for delegate " + delegate); } // only handle if state change originates from this dist machine if (message != null && ObjectUtils.nullSafeEquals(delegate.getUuid(), message.getHeaders().get(StateMachineSystemConstants.STATEMACHINE_IDENTIFIER))) { ensemble.setState(new DefaultStateMachineContext<S, E>(transition.getTarget().getId(), message .getPayload(), message.getHeaders(), stateMachine.getExtendedState())); } }

When setState is being called on the ensemble for an existing state machine within preStateChange, the state machine ID is not being populated into the StateMachineContext - without this how can I do anything with my back end persistence / caching if I can not relate it to a record?

Milesy commented 6 years ago

Also, given Gemfire handles all its clustering itself, I dont see why state machine should handle clustering in that case?

jvalkeal commented 6 years ago

Ah right, that's fair enough.

I'm not sure about preStateChange you asked, it might have been oversight. This integration really doesn't handle clustering itself, it just have a little abstraction to integrate with zookeeper to keep distributed state there(and trying to dispatch events to other machines). Originally I did have some plans to study if similar things would work with gemfire as well but never got that far. I remember having some discussions with my colleagues who knows gemfire much better and I remember gemfire being a difficult system to build same type of consistent storage as zookeeper is unless gemfire configuration is totally opinionated for this kind of particular use case.

Milesy commented 6 years ago

Thanks @jvalkeal

There is also logic in Looking at DistributedStateMachine there is logic within postTransition but I can not figure out where that gets invoked, my breakpoints are not being hit.

In AbstractStateMachine I do not see any calls to this where there is one for the following:

getStateMachineInterceptors().preStateChange(state, message, transition, stateMachine);

And as far as I can see there no way for me to replace the inner Interceptor that is in DistributedStateMachine

Really I am just trying to prove or disprove this as a viable method to sync state across instances, in comparison to just rolling my own by saving each state down to persistence, and then on a new instance startup, syncing from persistence and then listening to data changes.

The system is unlikely to have multiple people performing operations on the same records, it is just so we can run across more than one data centre in a hot-hot scenario and for users to see the same state of their workflow on interrogation.

Is there anything else you might suggest I look at?

thanks

jvalkeal commented 5 years ago

DefaultStateMachineExecutor is the one calling postTransition. Other parts were not designed to get modified.

Looks like your use case is better with a plain database. I think full blown workflow engines are always backed by a database. You may want to read github explanation what went wrong with their databases as you are also planning hot-hot setups.