sysown / proxysql

High-performance MySQL proxy with a GPL license.
http://www.proxysql.com
GNU General Public License v3.0
6.05k stars 983 forks source link

mysql-connect_retries_on_failure not effective to failures like connection refused #4597

Open xykhappy opened 3 months ago

xykhappy commented 3 months ago

For the hostgroup with a single backend, if the backend server goes down, the ProxySQL will return the error "Max connect timeout reached while reaching hostgroup" after the "mysql-connect_timeout_server_max" duration elapses if the backend is marked as "SHUNNED". In contrast, if establishing a direct connection without ProxySQL, the client will get a "connection refused" error or error code 111 immediately, which makes more sense.

After debugging the latest binary, it appears that ProxySQL keeps attempting to get a good connection at intervals controlled by "mysql-connect_retries_delay" until "mysql-connect_timeout_server_max" is reached according to the method MySQL_Session::handler_again___status_CONNECTING_SERVER. https://github.com/sysown/proxysql/blob/27e71d29729c67cbaf6beafcb3a3b4eeca7ab9b4/lib/MySQL_Session.cpp#L3177 And the retry is not even controlled by "mysql-connect_retries_on_failure" (as it should be) because the process always falls into this condition in the scenario I mentioned above. https://github.com/sysown/proxysql/blob/27e71d29729c67cbaf6beafcb3a3b4eeca7ab9b4/lib/MySQL_Session.cpp#L3244 If debug further, we will find the method MyHGC::get_random_MySrvC returns NULL all the time during the retry for this single "SHUNNED" backend. https://github.com/sysown/proxysql/blob/27e71d29729c67cbaf6beafcb3a3b4eeca7ab9b4/lib/MyHGC.cpp#L55 According to this method, the ProxySQL should attempt to bring the "SHUNNED" backend online but it fails all the time because the "mysrvc->time_last_detected_error" is always in the future related to "mysql-monitor_ping_interval".

                    // if Monitor is enabled and mysql-monitor_ping_interval is
                    // set too high, ProxySQL will unshun hosts that are not
                    // available. For this reason time_last_detected_error will
                    // be tuned in the future
                    if (mysql_thread___monitor_enabled) {
                        int a = mysql_thread___shun_recovery_time_sec;
                        int b = mysql_thread___monitor_ping_interval;
                        b = b/1000;
                        if (b > a) {
                            t = t + (b - a);
                        }
                    }
                    mysrvc->time_last_detected_error = t;

I'm not sure whether the logic is intentionally designed this way. However, the retry time should be controlled by 'mysql-connect_retries_on_failure' rather than its current implementation. Additionally, I believe it would be more effective to return the exact failure error to the client.

renecannao commented 3 months ago

Hi @xykhappy .

A premise of this issue: if you get error 111 when trying to connect, that means that there is no process listening on that port. 111 is OS error code 111: Connection refused , that means that the 3 way TCP handshake couldn't happen .

The issue you are facing is actually a feature. In fact, one of the main feature of ProxySQL is to handle backend failures and failover in a way that is completely transparent to the clients.

if the backend server goes down, the ProxySQL will return the error "Max connect timeout reached while reaching hostgroup" [...] In contrast, if establishing a direct connection without ProxySQL, the client will get a "connection refused" error or error code 111 immediately, which makes more sense

It doesn't really make sense once you understand the difference between the two setups.

If you are attempting directly to a backend, you get error 111 because there is no service. That means that the connection was not established.

If the client connects to ProxySQL , the connection to the client's backend (ProxySQL) was successful. Then, when the client tries to execute a query and there are no backend avaiable, ProxySQL is not able to execute the query. Should you expect the client to see its connection (that was successfully created) being dropped? For sure it can't get an error 111 , because 111 means Connection refused , and it is not possible to return to that stage. Maybe (I would argue no) the client could receive a Lost connection .

If a client connects to ProxySQL , it runs a query successfully, and the backend is restarted, would you expect ProxySQL to drop the client's connection? The way ProxySQL operates would allow you to restart a backend without any disruption to the client connections.

In a nutshell, you are trying to compare a direct connection to a backend to running a query through proxysql . They are two completely different entities (connection vs query) in two completely different setups.

Furthermore, please remember that ProxySQL is a reverse proxy and not a forward proxy. The client sends the query to the proxy , and the proxy will returns a result to the client: the client is completely unaware of the backends' involvement and status.

xykhappy commented 3 months ago

Hi @renecannao,

Thanks for the quick response! I completely agree with you regarding the handling of backend failures. And I have updated the issue title to prevent any confusion. To make my point more clear, I think the ProxySQL should offer users the same level of control to the retry logic of the "connection refused" error as it does to other failure types. In general, the ProxySQL should honor not only the "mysql-connect_retries_delay" but also "mysql-connect_retries_on_failure" when processing such error. In current implementation, it keeps retrying until "mysql-connect_timeout_server_max" duration elapses, which causes a slow failure. In our scenario, during an Aurora failover, we expect the process to fail rapidly to prevent connection creation requests from accumulating on the client side and subsequently overwhelming the backend Aurora instance. Conversely, we require a relatively large "mysql-connect_timeout_server_max" and "mysql-connect_timeout_server" (around 5 seconds) to ensure the client can endure a certain level of performance degradation in Aurora. This presents a contradiction since both situations depend on the "mysql-connect_timeout_server_max" setting.

Hope it makes more sense this time.

xykhappy commented 2 months ago

Created the pull request for this.