realm / realm-core

Core database component for the Realm Mobile Database SDKs
https://realm.io
Apache License 2.0
1.02k stars 163 forks source link

.lock file should not contain information about sync agents #6957

Open nirinchev opened 1 year ago

nirinchev commented 1 year ago

If a sync process crashes or is otherwise terminated, the SharedInfo stored in the lockfile will keep the sync_agent_present flag raised, meaning other processes won't be able to open the Realm file until the lockfile is deleted. Instead, we should devise a mechanism that ensures that when a process is terminated, there are no leftover flags and new processes can start using the file again.

@fealebenpae suggested storing an empty file in the management directory which is owned by the sync agent process and is released when the process is terminated, allowing for other processes to take over ownership, but other approaches are similarly valid.

This would solve the underlying issue causing the crashes reported in this ticket: https://github.com/realm/realm-dotnet/issues/3437.

tgoyne commented 1 year ago

The lockfile only stays alive as long as the Realm file is open in at least one process, so this is only a problem in the very specific scenario of opening the Realm in a non-agent process and holding the Realm open while the agent process crashes and then restarts. Multiprocess sync will make this whole problem go away entirely, so it doesn't seem worth trying to fix this edge case.

nirinchev commented 1 year ago

I agree - it's an edge case, but we've seen it reported multiple times, especially on Windows/Unity scenarios where the developer keeps the Realm open in Studio and restarts their application multiple times during the development process.

If multiprocess sync is coming in the near future, that will definitely be a preferable solution and we can close this in favor of just not throwing, though we'd still need to make sure whatever mechanism we devise for coordination between the sync agents accounts for the possibility of process crashes and we don't end up in a similar situation where a terminated sync agent is considered the primary one.

tgoyne commented 1 year ago

Dealing with the sync agent being suspended or terminated is the primary hard part of multiprocess sync and is why I'm expecting it to take a few months.

We currently haven't found a robust way to handle cleanup when one process in a session crashes even outside of sync. If a process crashes while holding the write lock then it'll only usually result in the write lock being released. Any versions used by the process which crashed will be leaked until the session ends.

sync-by-unito[bot] commented 10 months ago

➤ Jonathan Reams commented:

Not sure where this ticket stands right now. [~thomas.goyne@mongodb.com], will this be done as part of your multi-process sync work? Is there other work outside that project that needs to be done here?