Add re-frame integration that allows one machine to manage several states

mainej commented 2 years ago

This creates another re-frame integration that stores both machines and states in re-frame's app-db. This lets a single machine transition many states and avoids memory leaks when there are hundreds or thousands of states.

I'll summarize here, but for a much more detailed discussion of the problem see here and here. /cc @ingesolvoll

In the original integration, statecharts.integrations.re-frame/integrate adds event handlers that close over both a machine and a path. The path refers to a single state that is stored within the app-db. The event handlers can manage that one state, but if you want many states you run into problems.

In my case I'm using state machines via glimt to track the lifecycle of thousands of separate HTTP requests. Each request has its own state, and there can be many requests in-flight at once. To monitor the progress of each request, I have to create a different machine for each one. I pass those machines to glimt, which calls statecharts.integrations.re-frame/integrate for each one. But, since that creates new event handlers, and the handlers are stored in re-frame's global handler registry, I end up leaking the memory associated with the handlers and their machines and states. HTTP requests are usually short lived, so I end up with lots of memory consumed for machines and states I no longer care about.

The code in this PR takes a different approach. First, you register a machine, which gets saved in the app-db. Then, you initialize a state, saying where in the app-db you want it to be saved. Over time you can initialize many more states, all using the same machine. Later, you can transition a state by saying where it is in the app db, where its machine is, and what transition you want to call on it. I believe this approach is similar to the concept of "services"—it uses the app-db as a stateful container for immutable states and machines.

With these changes, it's easier to manage memory:

No new handlers are registered when you register a machine.
A machine can be used many times, instead of once per state.
There are tools to remove states and machines from the app-db when they are no longer needed.

A few more notes on the implementation:

If you'd like to see how I plan to use this integration, see these proposed changes to glimt. I've been working with @ingesolvoll, the maintainer of glimt to define this approach. Also see this gist which demonstrates how the new glimt code would be used.
This code includes a change to delayed transitions in clj-statecharts.impl, which allows the scheduler to be on either the state (new) or the machine (as before). This lets each state manage how it will be resumed. This should be a non-breaking change. Even if you don't want to maintain the code in this PR as a whole, would you consider merging this part? Then I could maintain this integration in a separate library.
Related to the above point, I marked the scheduler as internal to the state (by naming it :_scheduler). Please correct me if that's wrong.
I love the clj-statecharts docs! They're so clear and detailed. ❤️ If you're interested in merging this PR, I'd be willing to write docs for the new integration. Of course, any guidance about how to write the docs or where to put them would be appreciated.
I noticed a mismatch between the version of malli used in project.clj compared to the one in deps.edn. I think that will cause problems with the next release of clj-statecharts. I'd be happy to help figure out how to make that more maintainable, but that belongs in a separate Issue.

lucywang000 commented 2 years ago

Hey @mainej thanks for all these discussion and code!

I have read all your discussion with @ingesolvoll in the glimt project, and I think the problem statement is fair enough: the current implementation of the re-frame integration makes it difficult to reuse the same machine to manage multiple states stored in different locations of the app-db, even though they share exactly the same machine definition.

In some OOP analogy, it's just like instead of being able to create multiple objects of from the same class definition, the class has to be redefined for each object.

lucywang000 commented 2 years ago

I noticed a mismatch between the version of malli used in project.clj compared to the one in deps.edn. I think that will cause problems with the next release of clj-statecharts. I'd be happy to help figure out how to make that more maintainable, but that belongs in a separate Issue.

I'm not sure what you see, but in both of the files the malli version is 0.6.1

lucywang000 commented 2 years ago

I have think through this for quite some time (I read the whole thing yesterday, but only have time to reply to you guys today), and tbh I think the patch is overly complex. The thing that we want to fix is we have to create multiple machines and multiple re-frame event handlers for the same use case again and again. Because we basically binds one pair of event handlers (init + transition) to one path in app-db.

So what about instead of passing in a hard-coded path in fsm.rf/integrate, we do not provide the path during the call to integrate, and only later pass the path as a param of the init and transition events? This takes basically the same approach as your current code, but is way more easier to apply to the current implementation IMHO.

I have drafted a rough prototype in this commit: https://github.com/lucywang000/clj-statecharts/commit/fc330a9096f46be25fda73120457b216cf00f370. This involves some changes to the init and transition event param signature, so need to think about how to keep the backward compatibility.

What do you think @mainej @ingesolvoll ?

mainej commented 2 years ago

@lucywang000 your analysis is correct, the problem arises from binding a machine to one path. Your draft implementation in fc330a9 looks like it would fix that problem, almost. But what about delayed transitions? The rf/dispatch callback in the scheduler doesn't use the new path argument. It would dispatch the machine's transition-event, but with the wrong number of arguments.

When you register the machine, there's no path in scope. The path is specific to one state, so it can't live on the machine. That's why I changed clj-statecharts.impl/execute-internal-action to allow the state to carry the scheduler, not just the machine. (And why I assoc the scheduler on the state as the state is initialized, because the path is in scope then.) Let me know if you see another way to do it.

Otherwise, looks great! I could definitely make your proposal work.

Thanks for looking at this!

mainej commented 2 years ago

I'm not sure what you see, but in both of the files the malli version is 0.6.1

Ah, you're right @lucywang000, sorry about that. There's a new release of malli 0.6.2, which fixes a bug (which I see you discovered! 🥇) in 0.6.1.

ingesolvoll commented 2 years ago

You both seem to have a pretty good overview of the situation, so I'll just bring in the thing I care the most about.

It's pretty easy for a project or library to roll its own version of the re-frame integration if necessary. I already did that for kee-frame. The only thing that is critical in that sense is that the core library has the right building blocks available.

The one thing I missed when doing the integration with kee-frame was the ability to optionally clean up resources (event handlers, state) when the machine is no longer in use. That bit is particularly important when there's a very high number of machines in rotation.

I'm not sure if thousands of event handlers as opposed to 2 is a performance issue for an app, but at some scale maybe?

mainej commented 2 years ago

@ingesolvoll, that's a good point. If you generate lots of machines you can still run into problems with fc330a9.

As we've discussed before, technically you can unregister event handlers. So, with the original integration and with fc330a9 there is a way to release the memory held by a machine, by unregistering its handlers. But, it can cause errors as callbacks and delayed events dispatch to deleted event handlers.

To summarize:

	sc.i.re-frame	fc330a9	sc.i.re-frame-multi
# of handlers	2 * the number of states	2 * the number of machines	3 primary + 4 utility
Memory used by handlers	size of a machine * number of states	size of a machine * number of machines	n/a
Memory used by app-db	size of states	size of states	size of states + size of machines
Machine memory can be released easily	❌	❌	✔️
State memory can be released easily	❌	✔️	✔️

lucywang000 commented 2 years ago

But what about delayed transitions? The rf/dispatch callback in the scheduler doesn't use the new path argument. It would dispatch the machine's transition-event, but with the wrong number of arguments.

When you register the machine, there's no path in scope. The path is specific to one state, so it can't live on the machine. That's why I changed clj-statecharts.impl/execute-internal-action to allow the state to carry the scheduler, not just the machine. (And why I assoc the scheduler on the state as the state is initialized, because the path is in scope then.) Let me know if you see another way to do it.

That's a good point. We can fix it by allow passing the state when calling schedule, so the dispatch function could make use o it:

 (deftype Scheduler [dispatch ids clock]
   IScheduler
-  (schedule [_ event delay]
-    (let [id (clock/setTimeout clock #(dispatch event) delay)]
+  (schedule [_ event delay state]
+    (let [id (clock/setTimeout clock #(dispatch event state) delay)]
       (swap! ids assoc event id)))

During fsm init, we simply store the path in app-db in the state, e.g. under the _rf-path key. Now the re-frame scheduler could get the path information when scheduling the event:

 (defn make-rf-scheduler [transition-event clock]
-  (fsm.d/make-scheduler #(rf/dispatch [transition-event %]) clock))
+  (let [dispatch (fn [{:keys [_rf-path] :as state} event]
+                   (rf/dispatch [transition-event [path event]]))]
+    (fsm.d/make-schedule dispatch clock)))

I think this way is better because we store simple data rather than a stateful scheduler object in the app-db.

lucywang000 commented 2 years ago

I would not worry about memory bloat when there are "thousands of different machines thus thousands of re-frame handlers and machines" in the same web application. Even if such a huge application exists, it would already consume tons of memory and the extra overhead of clj-statecharts would be relatively negelectable.

mainej commented 2 years ago

Yeah, I think I follow @lucywang000. Seems like a reasonable approach. And I agree—storing a scheduler in the app-db, especially many schedulers, is kind of gross.

I guess I always worry about changing protocols. You never know who's implemented them. But I suppose IScheduler isn't really a public interface.

lucywang000 commented 2 years ago

But I suppose IScheduler isn't really a public interface.

Among all the not-so-many user of this lib, there is unlikely anyone who has used this hidden interface.

lucywang000 commented 2 years ago

btw I have just cut a 0.1.1 release to update the version of malli. Thanks @mainej for pointing this out.

ingesolvoll commented 2 years ago

I did not study the code in these PRs thoroughly enough as I'm having a rather intense week, but from what you are saying:

Storing the scheduler or anything that's not pure data in the client db can be a big problem. I know the org I'm working in right now would not consider a library that does that.

But then again, as long as the re-frame integration is completely external to the machine code itself, that isn't really an issue.

mainej commented 2 years ago

allow passing the state when calling schedule

Thinking about it more... this is they key. If the state were passed when dispatching the scheduled function, then I (or someone else) could maintain the code in this PR as a a separate library. The existing re-frame integration could remain unchanged. You could modify it as described in fc330a9 if you wanted to, but a custom integration wouldn't need that. (Though, putting on my devil's advocate hat, fc330a9 would be a breaking change because all users of the integration would have to change their transition-event dispatches.)

It sounds like you already have that change to the scheduler written. But, if you'd like, I can submit a separate PR. Let me know what you prefer.

mainej commented 2 years ago

I've been playing with this, and it's not working. If two delayed transitions start at around the same time, the first one works, but the second one's transition gets lost. I think I've tracked down the reason why.

In the code below...

(deftype Scheduler [dispatch ids clock]
  IScheduler
  (schedule [_ state event delay]
    (let [id (clock/setTimeout clock #(dispatch state event) delay)]
      (swap! ids assoc event id)))
  (unschedule [_ event]
    (when-let [id (get @ids event)]
      (clock/clearTimeout clock id)
      (swap! ids dissoc event))))

There's a problem with saving the setTimeout id keyed by the event. If two states enter the same delayed transition at the same time, they will both use the same event. So, the setTimeout id of the first one is replaced by the id of the second one. When the first one's transition happens, it ends up calling (fsm.d/unschedule scheduler event). But, this does the wrong thing now: it cancels the second one's timeout by calling (clock/clearTimeout clock id).

I'm not sure how to fix that. Any ideas?

ingesolvoll commented 2 years ago

@mainej I think it would make sense to have 2 PRs. One for the changes needed to the core in order to support any kind of integration. And one for re-frame integration fixes.

mainej commented 2 years ago

@ingesolvoll agreed. I've opened https://github.com/lucywang000/clj-statecharts/pull/8 so we can discuss the core changes separately from the integration changes.

@lucywang000, as mentioned in #8, I've created https://github.com/mainej/clj-statecharts-re-frame. It uses the code in #8 and includes a failing test demonstrating the problem I described above.

Shall we move this conversation to #8?

lucywang000 commented 2 years ago

If two states enter the same delayed transition at the same time, they will both use the same event

Well, we can simply fix that by using (_rf-path, event) as the key into the Scheduler.ids map.

Ofc this would require adding the state to the IScheduler.unschedule method as well.

update: I see you already do similar things in #8, I'll comment there.

mainej commented 2 years ago

Closing this, since we agreed in #8 that any re-frame integrations that use that new functionality can be maintained externally.

lucywang000 / clj-statecharts

Add re-frame integration that allows one machine to manage several states #7