clean up connections on failed node after recovery of failed node

jdanilson commented 2 years ago

When proxysql detects an issue on a percona galera node and fails over to another node is it possible to clean all connections from the failing node when it returns to service. We want to prevent any possibility of writes on more than one galera node.

renecannao commented 2 years ago

This is exactly what should happen. Error log should show records of all the actions performed, including killing all the connections to the failed node.

to clean all connections from the failing node when it returns to service

I am not sure why you ask to clean the connections when it returns to service . The connections are killed as soon as the server is detected as faulty.

jdanilson commented 2 years ago

Rene, Thanks for your comments.

We are enthusiastic users of Proxysql with over 150 hosts running the software. For the most part it does exactly what we want. Perhaps due to our implementation method we face a continuing problem,

We started using proxysql over two years ago to front our Percona Galera cluster nodes. The first two attempts at deployment failed as we could never get the applications to process (using java or wildfly) without errors. Our dev teams and Percona were unable to track down the reasons for the errors and we gave up. After some further trails we developed a deployment that works 99% of the time. Our deployment for our pxc galera clusters consists of:  a bigip vip for reads and a separate ip for writes. The application only connects using the vip.

All reads round robin to all nodes using port 3306.
All writes round robin to all nodes to a proxysql on port 3307. (Yes, the proxysql is co-located on each pxc node.)
proxysql_servers contains only one entry and that is always host 1.
Each proxysql then hits the writer on port 3306.
There is one writer in proxysql. max_writers=1, writer_is_also_reader.
All nodes are readers.
There are no query rules.

This configuration allowed us to deploy proxysql with zero changes to the application. The only change made after deploying proxysql on each db node was to change the write vip to send traffic to port 3307 rather than port 3306. The pxc server continues to listen on 3306. It works and has worked well. No one we've talked to likes the co-location but we are forced to do this because we cannot get other hosts to dedicate to proxysql nor can we locate proxysql on the application hosts. (These constraints are not technical, they are bureaucratic and due to the nature of how the government works. I cannot change this.)

Our problem occurs when our os teams patch the hosts which they do every 90 days in the middle of the night. (This interval is about to decrease to 30 days.) Here is a progression of our issue:

pre-patch state host 1 + proxysql1 + db 1 <—- writes are here. reads too are here. host 2 + proxysql 2 + db2 <— reads are here host 3 + proxysql 3 + db3 <—- reads are here.

patching commences host is shutdown for patching. This works well and our failover has been consistent.
host 1 + proxysql1 + db 1 down. no traffic host 2 + proxysql 2 + db2 <— writes are here. reads too are here host 3 + proxysql 3 + db3 <—- reads are here.

host 1 restarts. Both proxysql and pxc are enabled for automatic restart.  host 1 + proxysql1 + db 1 <— identifies itself as primary writer; connections begin host 2 + proxysql 2 + db2 <— writes which are present remain. host 3 + proxysql 3 + db3 <—- reads are here.

We have found the way out of this is to

kill the connections remaining db2 or
upgrade node 2 weight to higher than node 1 and then kill the connections on db1.

 Neither solution is optimal.

Hence our request; if you fail over stick there! Or alternatively, you know connections are now on the primary writer of db1 so kill the ones on db2.

Perhaps we are doing something wrong and you will show us the error of our ways. Percona has written a new monitoring script which they have suggested and we are investigating.

JavierJF commented 5 months ago

Hi @jdanilson,

thanks for sharing your scenario here too, as I mentioned also in my comment in issue #3670, the reason for the current behavior, and for why the 'stickiness' after a failover isn't a configuration option is documented as part of the FAQ. It's also now stated in the FAQ that if a custom monitoring like this is required, users can always resort to the scheduler for implementing their own monitoring solution.

It's also mentioned as part of the Galera Cluster monitoring documentation:

Since this is documented, and there are reasons for this option not to exist, I'm proceeding to close the issue.

Thanks again for sharing your scenario, and being of part of the community. Regards, Javier.

sysown / proxysql

clean up connections on failed node after recovery of failed node #3671