Closed sjmudd closed 1 year ago
"talk to thousands of MySQL" is not typical use case of most user of this driver. Most users know the server because they configure the sql.DB.
I won't break backward compatibility only for such rare use case.
By the way, we print actual low level error behind errBadConnNoWrite
into stderr.
Can't you see it?
"talking to thousands of MySQL servers" may not be a typical use case, but you provide the MySQL golang driver, so while I do not expect special handling of my specific use case I do believe the current driver's error handling in this case could be improved. Yes, you print the low level error to stderr sometimes. However, given I'm polling a large number of servers concurrently these log lines may overlap so the logging to stderr is not I think an appropriate solution. The caller can not collect the error and report it appropriately within the application. In many cases you do not report the error so I can't be sure of exactly which of the actual return calls was being triggered and therefore which of the exact specific cases might be happening here. Again this application's primary concern is to connect to a server, identify if it's healthy and do that periodically. The response returned back to the driver, especially since you may be doing lazy connections may appear anywhere a SQL command is called and so the caller can not easily determine what's going on.
Which "backward compatibility" changes do you think this makes?
A colleague said:
It seems other drivers were affected by a similar issue: https://github.com/denisenkom/go-mssqldb/issues/586
This lead to golang upstream changes, merged in golang 1.18: https://go-review.googlesource.com/c/go/+/333949
When drivers return driver.ErrBadConn, no meaningful information about what the cause of the problem is returned. Ideally the driver.ErrBadConn would be always caught with the retry loop, but this is not always the case. Drivers today must choose between returning a useful error and use the rety logic.This allows supporting both.
So my concern right now is still the same:
e.g. A master dies (this happens and is expected) and so we need to choose the most up to date replacement, and then relocate the other replicas under the new master. Often we see the relocation request to a server we'd seen as being healthy < 5 seconds previously fails. A new connection is opened and we get a "driver: bad connection" error. That's all I see and I don't know why the connection failed. That's a real concern. Of course if I log into the server it's working fine.
I can certainly try v1.7.0
of your driver together built with the latest version of golang to see if this helps and report here. I was told that it looks like orchestrator was built on golang 1.16 even though that's a bit old, so the update in the golang version the MySQL driver and database/sql
might trigger a change in behaviour. I can verify that if you think it might make a difference, but it does not entirely help with logging some things to stderr become effectively invisible to the caller. Also the current logging is not easily identifiable against a specific database connection. If it were returned to the user, wrapped in some way, then it would not be your problem: the user would have the information and could "track" which database connection this referred to.
However, given I'm polling a large number of servers concurrently these log lines may overlap so the logging to stderr is not I think an appropriate solution.
If stderr is not good for you, you can use SetLogger.
We can not support 100% use case. In general, it's sysadmin job that reading logs and investigate errors. Applications don't need precise information.
We do not know about orchestrator. We can not fix your errors. We do not know what erorr/log is needed to fix your errors.
Please vendor & patch (or fork) and add anything you need in it. After fixing your problem, please create a pull request to add error/log that is useful for 99% users, not only for you.
Which "backward compatibility" changes do you think this makes?
A colleague said:
It seems other drivers were affected by a similar issue: denisenkom/go-mssqldb#586 This lead to golang upstream changes, merged in golang 1.18: https://go-review.googlesource.com/c/go/+/333949 When drivers return driver.ErrBadConn, no meaningful information about what the cause of the problem is returned. Ideally the driver.ErrBadConn would be always caught with the retry loop, but this is not always the case. Drivers today must choose between returning a useful error and use the rety logic.This allows supporting both.
Thank you for teaching this. We support Go 1.13 currently. But don't wait we drop Go 1.13-1.17 support.
Note that most ErrBadConn errors are consumed by database/sql retry loop. They are not returned to caller unless it happened in the last retry.
So checking log is much better way than trying to retrieve all debug information from error. That's how we generally use in developing this driver. Adding much debug logs that is only needed during debugging.
Hi @methane,
I will look to see at least if for now forking the driver to get a better insight into what is actually happen will give us hints on the problems we are seeing. Yet partially I find it somewhat disconcerting that the error: driver: bad connection
returned to the application is extremely generic. There are likely to be 2 types of error:
In both cases from what I can see there may be multiple "flavours" of this. MySQL natively returns an indication of the MySQL error but often part of the error response provides something that comes from the OS:
$ mysql -h 10.1.2.3
ERROR 2003 (HY000): Can't connect to MySQL server on '10.1.2.3:3306' (60)
$ mysql -S /path/to/nowhere
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/path/to/nowhere' (2)
So you get some information from the MySQL perspective but it might be that the MySQL server gives you more details which come from the OS itself. Perhaps that additional information might be useful when the driver.ErrBadConn
is returned by wrapping it and providing the detail?
I see that the code returns driver.ErrBadConn
if it internally generates errBadConnNoWrite
. I'm guessing this may be the contract that you need to fulfil with database/sql
. Is that the case?
Either way the error in packet.go
comes from writePacket
, writeHandshakeResponsePacket
, writeAuthSwitchPacket
, writeCommandPacket
, writeCommandPacketStr
, writeCommandPacketUint32
, writeExecutePacket
at least as far as I could tell and the it might perhaps be useful to know which of these specific cases is actually the cause of the generic errBadConnNoWrite
error you return now. Given I see the current error I can not see this detail which would be useful to me. I would imagine that same information might be useful to other users of your driver if they want to better understand a connection failure when talking to their backend MySQL server.
A further thought here is that if there is an error of this type the upper layer retries the statement and the application may be completely unaware there was a problem. This is partly good as the app sees no problem, but it is also unaware the connection may be flaky.
If an error is reported by database/sql it will be on the retried attempt so the original error will be lost. Is that right? Even worse.
It might be nice to have such metrics/statistics available somewhere so the application can detect or determine such issues. If you have any sort of long running process or daemon which talks to one or more databases I can see this being interesting, but I would expect this to be perhaps exposed via database/sql or through database/sql. Maybe that is not feasible now. Exposing some metrics from the MySQL driver would be fine.
If the statitics provided were only error statics at least the happy path would not be affected performancewise and the overhead of updating error counters should be negligible anyway.
Do you have any thoughts on these points?
If an error is reported by database/sql it will be on the retried attempt so the original error will be lost. Is that right? Even worse.
Yes. database/sql catches most ErrBadConn and doesn't return it to application. Most errors are lost. That's why I strongly recommend to use debug log repeatedly.
Do you have any thoughts on these points?
"print debug" is the best way.
FWIW: I am not sure this really works. I also checked about https://github.com/golang/go/issues/48309 and this is a feature request to make the retry setting optional (or configurable) which I think is what I would want for normal processing in my case of "testing health of servers I manage/monitor". That still requires you to modify your driver to return different, wrapped errors depending on the specific issues currently seen that generate driver.ErrBadConn
. As things stand pouring over logs as mentioned previously is not helpful as the identity of the connection that fails is not provided. Returning the real error to the caller makes it my problem to correlate the error with the connection not yours.
Note the comment from https://github.com/golang/go/issues/48309#issuecomment-1403657874:
It is encouraged for database drivers to wrap BadConn.
Related I also notice that setting timeout=1s
may not work as expected in the DSN as the retry logic if you return a driver.ErrBadConn
with trigger a new connection with the same timeout, thus potentially making the connect timeout take 2 seconds in some scenarios. That's something I'm looking to fix in my code by using the context option and add a timeout at the database/sql layer.
So it may be good to mention this option to people as the configured behaviour might not match what's expected.
We support Go 1.13 currently. But don't wait we drop Go 1.13-1.17 support.
What is this project's philosophy around supporting different versions of go? Go itself no longer supports any of the versions you mentioned, even with security patches, because their commitment is "Each major Go release is supported until there are two newer major releases" (ref).
When could we expect changes that require the 1.18 interface to be incorporated? Our use case for application awareness of driver.ErrBadConn
isn't debugging, but allowing our services to make decisions.
Edit: I see you have this stated in the README already! "We aim to support the 3 latest versions of Go." So, since it's likely go1.20 will be released in 1-2 months that is the earliest it would be reasonable to request any breaking change requiring go1.18.
I'm concerned there is no progress on this. I think we can probably provide a PR which we're trying internally to resolve the issue and it seems to work fine for us. Reporting of the issue is provided in a better way and also the error is wrapped to ensure that the upstream database/sql layer gets the expected error, but they already handle that.
So if you'd like us to provide a PR to inspect and that would make progress I see that as being very valuable. I just need to confirm with colleagues prior to doing that.
Now logger can be configurable per DB. See #1408 You can add information about which server cause the error log.
Issue description
I manage an application called orchestrator which talks to thousands of MySQL servers and is used to detect MySQL failures and if a master or intermediate master fails to re-arrange the replication topology of the cluster to ensure it can continue to be used. This system has a reasonably short connect timeout of 1 second.
The problem I see is that we get quite a lot of errors when connecting to "the database" prior to doing operations such as reconfiguring the server or reconfiguring from which other MySQL server it should replicate. The failed connect errors are the concern. See: https://jira.percona.com/browse/DISTMYSQL-261 for some context of the issue from the application side.
Orchestrator is currently using the v1.6.0 version of the MySQL driver. I'm aware there's a newer version v1.7.0 but as far as I can see there's no change in logic around the topic being discussed.
The error message itself is "extremely vague":
driver: bad connection
does not indicate the actual problem and the logging does not indicate to which host:port this is actually happening and given the fact I'm continually polling a large number of MySQL servers identifying the source of the problem and the exact cause is actually quite important as orchestrator is intended to determine if the MySQL host is healthy or not, so identifying the reason for a connection failure is important.The bad connection errors are not that frequent but do add up. The problem is it's hard with current code to identify the source of the issue OR the specific issue that is being caused in a single log line.
I see that currently the driver logging combines multiple different conditions under the same umbrella term
errBadConnNoWrite
and in some cases it logs the error independently of the error returned to the caller.It would seem better, given recent changes in go error handling, to extend the
errBadConnNoWrite
errors returned to the caller into separate errors for each condition triggered so that applications can still detect this error witherrors.Is(err, errBadConnNoWrite)
but by wrapping the specific error with the error picked up earlier in the code (within the driver) the full error can be returned to the caller and identified more completely.Ideally I'd like the driver to report in the error something like the
mc.cfg.Addr
value of the host being talked to. If that is not considered acceptable then it would be necessary for the caller to be adapted to record this information for all connections so it can be logged with the error received from the driver when it happens.Summarising: the exact cause of
driver: bad connection
errors is not clearly identified in the error returned to the caller. I believe it would be good to return a more detailed error to clarify each of the different cases where it is returned more explicitly and suggest that users useerrors.Is()
to check forerrBadConnNoWrite
if existing functionality is required. If possible include the address of the host where this happens.If there is a better way to identify the specific errors and the host to which they correspond then please share your thoughts.
Error log
See: https://jira.percona.com/browse/DISTMYSQL-261 for some details/examples
Configuration
Driver version (or git SHA):
*Go version:
quay.io/centos/centos:stream8
docker images now show1.19.2
.*Server version:
*Server OS: