Closed PVince81 closed 3 years ago
let's go with approach 2
as far as I can see, even when choosing "Leave conversation" we don't touch the URL nor send "DELETE .../participant/active" and the signaling server tells the frontend that the room does not exist any more
so doing the same approach for "Delete conversation" would align the behavior with that
as discussed with @nickvergessen we'll also go with approach 4, PR here: https://github.com/nextcloud/spreed/pull/5738
Steps to reproduce
Expected behaviour
Redirect to the main page, not "not-found" page and the room is deleted.
Actual behaviour
Sometimes, the room is not deleted and the console returns a 404 error for the DELETE operation.
Talk app
Talk app version: 11.2.1
Custom Signaling server configured: yes
Browser log
Server configuration
Nextcloud Version: 21.0.2
Server log (data/nextcloud.log)
Nothing relevant.
Analysis
This was reported to me internally, I was not able to reproduce this locally.
I've dug into the code and it looks like there are two requests happening in parallel when deleting a conversation: 1) switching to the default route triggers a DELETE to /room/$token/participant/active leading to https://github.com/nextcloud/spreed/blob/v11.2.1/lib/Controller/RoomController.php#L1659 2) deleteConversation call triggers a DELETE to /room/$token
The code does not wait for 1) to finish and directly triggers 2).
After adding more extensive logging on the target env where the problem occurs, I noticed that the 404 for the deletion actually does find the room, but fails to find the participant here: https://github.com/nextcloud/spreed/blob/v11.2.1/lib/Middleware/InjectionMiddleware.php#L156
It looks like depending whether the session is still there, which correlated with request 1), it might enter the if or else statement, using a different approach for finding the participant. A race condition could cause the session id to not exist any more in the database while it still existed at the time of retrieval from the PHP session.
I did more local tests with logging in both code paths and noticed that the one or the other branch was randomly picked, which confirms that there is some race condition in place. However in my local setup it never resulted in a 404.
From my understanding, it should never happen that the session id can be deleted in one PHP request while the other one is still reading it. Unless maybe the PHP session handling was configured differently on the target system...
Solutions
Approach 1: wait for request 1) to finish
Currently not possible because of the way the routing + onRouteChange events are done, these are not async/await-compatible on the event bus, so we can't wait for the "leaveConversation" to finish.
Approach 2: discard request 1)
I've tried this locally and it results in the signaling server telling the frontend that the session is gone, so it redirects to "not-found" instead of the root page. Not too bad, could be a way.
Approach 3: wait for request 2) and then run request 1)
Doesn't make much sense as it results in the same result like approach 2 depending on the speed at which the request 1) is sent.
Approach 4: add fallback in the backend
Modify https://github.com/nextcloud/spreed/blob/v11.2.1/lib/Middleware/InjectionMiddleware.php#L156 and whenever the participant cannot be found with session id, fall back to getting it without the session id then.
Drawbacks: since this is a common code path used by all controller methods, it could cause side effects.
Also I'm not sure if this is a good idea at all or whether it breaks semantically in some places.