[WIP] Add connection resume, client state tracking, robust reconnect logic, reconnecting hooks, more

JamesHutchison commented 8 months ago

_{By submitting this pull request you agree that all contributions to this project are made under the MIT license.}

Issues

While in a discussion on the discord, it became apparent that ReactPy was unusable for any application where dynamic scaling is required. For example, in cases where Azure/GCP will start/stop workers for dynamic load balancing. As a result this, users will be disconnected randomly and then would have the state of the ReactPy components reset.

There's also other issues I encountered while building this. Please see my comments (will be added soon) which walk through this very large PR. The intent is not to have this PR merged in all at once, rather, it is a guide to subsequent smaller PRs that can properly add tests, etc. I do not have the time to properly land this PR, so I am asking for help.

Solution

The solution to the problem was to have the client store the state that's used for rendering components. The theory is that if the client state matches the server state, then things should work just fine.

In reality, there was some challenges to get to that. The first was that IDs of components and events were random instead of deterministic, so this needed to be changed. Now the IDs are based on their patch path.

The next concern is security. While the state isn't obvious to most users, without proper security a client could manipulate variables that an author would have assumed could only be set by the server. An example might be is_admin or something, or maybe the user ID of user viewing the page. Another issue is that server secrets might leak to the client if someone isn't careful.

For security, all data given a sha-256 hash as a signature. The server will reject a state variable if it doesn't match the hash. Users are given a personal salt, which they must provide to the server upon reconnection, and the server has a secret "pepper" that is added to make things more difficult. I wasn't satisfied with this, so I added OTP codes that will change, by default, every 4 hours. The OTP codes use site-packages directory's contents (specifically, the parent directory of where reactpy is installed) and the create times to generate a random secret that feeds into the SHA-1 function that generates them. The OTP codes (right now, in the past, and in the future) are added to the hash just to make it that much harder.

All state values have a key that is derived from the file and line number. To avoid leaking this information via brute force, the key is a sha-256 that has about half the bits cut off, making it difficult to recreate.

The client behavior has been revamped. It now will always reconnect, has improved reconnect behavior, and also will disconnect on idle activity (default is 4 minutes). When the user moves the mouse or scrolls, it will reconnect. There's also hooks for when it is reconnecting and has successfully reconnected. The default behavior is to gray out the page and display a spinning pipe character. It's not great.

The performance impact appears to be negligible.

Checklist

The checklist will be for each individual feature that should be a separate PR. TBD

Work items remaining (this PR):

[x] Add serialization for datetime types such as timezone and timedelta. This logic is currently in Heavy Resume but makes sense to just give to everyone.
[x] Fix the server outrunning the client by moving the message handlers out of the tsx file and into the javascript file
[x] Comment on the PR on everything

Extracted Issue List

[ ] - A - bug fix for when server would return "layout-update" before the client was ready.
[ ] - B - bug fix for inconsistent script rendering that resulted in double execution of scripts. Scripts also always are returned to the DOM for user inspection rather than executed directly.
[ ] - C - Message updates
[ ] - D - Have client keep track of state variables (the big one)
[ ] - E - add ability to debug messages sent and received on the client
[ ] - F - Add idle disconnect time to client
[ ] - G - Add connection timeout so that if you connect to a bad node the client doesn't stay stuck to it
[ ] - H - Improve message types in client
[ ] - I - switch to orjson for faster serialization
[ ] - J - Bug fix for corrupted hook state - change hook state to a contextvar
[ ] - K - Usage of slots
[ ] - L - add priority queue to component rendering - not functioning as expected currently
[ ] - M - messages should be a class and not a typed dict
[ ] - N - Use of protocols (substantial performance and usability issues)
[ ] - O - don't use hasattr in the manner identified

Archmonger commented 8 months ago

This is definitely way too much within one PR. Can you break this into smaller PRs?

JamesHutchison commented 8 months ago

Assuming this headache of mine doesn't turn into something worse, I'll add comments tomorrow. I don't have the capacity to properly break this out, create / update tests, etc for the changes and I'm asking the community for help to land these features. This PR is intended to be a starting point and is not intended to be merged.

Archmonger commented 8 months ago

Also just to re-iterate what I said over discord, I've only experienced WS disconnections under high load when ReactPy is not using a BACKHAUL_THREAD.

A potential alternative to client-side state storage is simply migrating that implementation to core.

JamesHutchison commented 8 months ago

It's not clear to me how backhaul threads are equivalent to this. Do you have a doc that explains the architecture? My impression from your description was that it helped with stability in django but it wouldn't help if your node count was scaled down and you still had active connections on the terminated node (same with a client's internet disconnecting / reconnecting).

I think having the client manage their own state is a perfect solution. It very much simplifies the infrastructure and reduces costs. There's a slight delay due to the copy of data but since you're reconnecting there's going to be a delay anyways.

JamesHutchison commented 8 months ago

My suggestion on approach:

Keep the javascript client updates and fixes as a single PR, even though it resolves multiple issues. It either needs to be backwards compatible or could merge into the PR that does the server side of Issue D
Some of these are small but big wins. I would do them first. For example, fixing the performance problem caused by the protocols isinstance checks is only a couple lines of code
Some are worth discussing before action. The priority queue sounds good but doesn't actually work as expected. The render times are also really small so the value of a priority queue is questionable. It might be useful in the future though.
Measure using a tool like py-spy or something to get performance bottlenecks rather than making any assumptions. I thought the sha256 calls would show up somewhere but they, and their parent functions, haven't.

JamesHutchison commented 8 months ago

Example usage that supports Pydantic / SQLModel objects:

def default_serializer(obj: Any) -> Any:
    if isinstance(obj, BaseModel):
        return obj.dict()
    raise TypeError(f"Object of type {obj.__class__.__name__} is not JSON serializable")

state_recovery_manager = StateRecoveryManager(
    serializable_types=[
        TemporaryUser,
        QAMessage,
    ],
    pepper="pepper",
    default_serializer=default_serializer,
)

configure(app, Main, Options(head=head, cors=cors_options), state_recovery_manager)

JamesHutchison commented 8 months ago

If you want to see this in action, it is live on heavyresume.com

Inactivity timeout is set to 4 minutes. You can see it happen in the console. If you scroll down and let it time out, you'll notice that it doesn't regenerate the view on reconnect. Moving the mouse or scrolling triggers reconnection.

Thanks!

JamesHutchison commented 8 months ago

One thing I'd like to bring attention to - if the layout logic is rewritten, please leave some means to re-render all components of a certain type. Next best thing is some means to trigger a whole page re-render from the client state. The use case would be hot reloading. Right now I can successfully use jurigged to hot reload - however unless the component has a toggle or something to force a re-render, I have to reload the page.

edit: this is no longer true with my PR below. That PR will disconnect and reconnect and force a rerender.

JamesHutchison commented 8 months ago

This PR (on my fork) leverages the logic from this PR to add hot reloading. I will not be merging it into my main branch because I don't want to clutter things up further, and I'm fine just checking out that one branch in the mean time.

https://github.com/JamesHutchison/reactpy/pull/12

JamesHutchison commented 6 months ago

This linked PR fixes a bug where use_effect(my_func, []) was rerunning after the next reconnection. The fix from that PR applies to this one.

https://github.com/JamesHutchison/reactpy/pull/16

JamesHutchison commented 3 weeks ago

After using the heavy stack for many months, here's the known issues that are currently unresolved and may be attributed to my changes:

pokey loading on slow connections - seems to be caused by layout updates requiring confirmation from the client before moving forward.
imperfect connection recovery - connection recovery for larger states always fail to rebuild in time resulting in destructive reconstruction that can result in state loss for text boxes that have unsaved changes
input hell - inputs and components that have matching DOM paths aren't recreated unless the key changes or the DOM structure leading up to them changes. This means, for example, if you change a selected list item, the inputs won't update even though the parameters into the component do
incomplete component updates - in some cases the render order of components results in child components not rendering after the parent finishes. For example, a parent triggers a child render, the parent is then rendered, perhaps due to an event or state change, such that the child needs a rerender, but the child is marked as having already rendered and does not render again.

I'm pretty sure the input hell stems from the change that gives state a deterministic target based on the DOM path. This change was needed to support state rebuild.

The rest may be existing ReactPy issues.

reactive-python / reactpy