Reconnect, if the current node has lost its connection

mariano / disque-php

PHP library for Disque, an in-memory, distributed job queue

MIT License

132 stars 19 forks source link

Reconnect, if the current node has lost its connection #31

Closed Revisor closed 8 years ago

Revisor commented 8 years ago

When switching nodes, the prioritizer can recommend to stay on the current node. The current node however could have lost its connection in the meantime. In that case we will try and reconnect, or if not possible, connect to the next best possible node.

The check whether the current node has disconnected has been moved so that it is checked every time we execute any command, not just after the GETJOB command.

I tried to write a test for this situation, but ended up with a brittle monster of a test depending on the order of method calls among several mocks. :-/

mvrhov commented 8 years ago

The fix is better than my, but there is still a fundamental problem. Calling connect method should establish connection and by removing my fix, it won't.

Revisor commented 8 years ago

Thank you for having a look at the proposal. Can you please explain under what circumstances calling the connect() method can land us without a working connection (other than if no node is available at all)?

Finding the first working node is the purpose of findAvailableConnection() and this method has only two outcomes - either it returns a connected node, or it throws an exception if no node works.

mvrhov commented 8 years ago

I described one case in my original ticket.

When adding into the queue the Queue::checkConnected is called, the isConnected there returns false, so the Client::connect is called, which then forwards message to the Connection/Manager::connect, that function that selects a random node and calls $this->switchToNode($currentNode);. The problem in switchToNode is that it tries to be smart and do not connect to a node if we are swittching to the same one. The problem is that this function doesn't know the status of the connection of that particular node.

Revisor commented 8 years ago

Thanks for the great description, now I understand the whole problem at last. I'll update my proposal.

Revisor commented 8 years ago

Here's a new version. I have fixed the bug reported by Miha and improved the reconnection overall:

Reconnection will first try to connect to the nodes in cluster as revealed from the last HELLO command.
If this doesn't succeed, it will try to reconnect to nodes as set in the user-supplied credentials

The reason for this order is that the HELLO most likely contains newer information about the cluster than the initial credentials.

If we reconnect to the same node we were connected before, the new connection is not ignored. This was the original reported bug.
Also if we reconnect to the same node, we keep the node stats (total job count).

And I have added two tests for this. The first one tests reconnection based on HELLO, the second one tests reconnection that falls back to the credentials.

I'm looking forward to your code review and criticism.

mariano commented 8 years ago

@Revisor a total +1 on my side and <3 the mock overdose on the test. Let me know if there's anything else you want to add or if we are good for merging

Revisor commented 8 years ago

I have slept on it, read it again and I have nothing else to add. I'm not sure if that much mocking doesn't make the tests brittle. But we need them, so...