vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.68k stars 2.1k forks source link

Clean up RPC Errors #3231

Open bbeaudreault opened 7 years ago

bbeaudreault commented 7 years ago

The below turned into a lot of words and blocks of code. What i'm looking for in this issue is feedback on the potential changes listed near the bottom. Also, the code blocks below are not line-wrapped due to the formatting of markdown but typically I see them line-wrapped into a big block of text. So keep that in mind in assessing the readability.


For the past couple months I've been intermittently chipping away at removing invalid or UNKNOWN error codes returned by vitess, which are less than helpful for users and alerting. At this point that happens very rarely, and when it does I'll fix on a one-off basis.

Another problem to fix, however, is the error messages themselves. At Hubspot we have 150+ engineers running apps that interact with vitess. When an error happens, they get exceptions like this, which are very hard to parse and filled with vitess "junk" (in their mind):

java.sql.SQLSyntaxErrorException: legacy_code: BAD_INPUT_LEGACY
message: "vtgate: http://default-vtgate-us-east-1e-46-3400743101-zk9q3:15101/: target: 
Terms.0.master, used tablet: qa_iad-342791901 (172.18.166.195), vttablet: rpc error: code = InvalidArgument desc = Illegal mix of collations (ascii_bin,IMPLICIT) and (utf8mb4_general_ci,COERCIBLE) for operation \'=\' (errno 1267) (sqlstate HY000) during query: select /*+ MAX_EXECUTION_TIME(30000) */ order_id, order_type, request, created_by, created_at from additional_requests where order_type = 2 and order_id = \'13386Submitted\342\200\213\' limit 100001, CallerID: TermsService-web.hs-deals.vitess"
code: INVALID_ARGUMENT

or

java.sql.SQLNonTransientException: Vitess RPC error: legacy_code: PERMISSION_DENIED_LEGACY
message: "vtgate: http://default-vtgate-us-east-1e-43-899121640-frd30:15101/: execInsertUnsharded: target: ScriptLoader.0.master, used tablet: qa_iad-853611901 (172.18.164.141), vttablet: table acl error: \"ScriptLoaderService-userweb.hs-engineering.vitess\" [ScriptLoaderService-userweb.hs-engineering.vitess hs-engineering.vitess] cannot run INSERT_PK on table \"enabled_scripts\", CallerID: ScriptLoaderService-userweb.hs-engineering.vitess"
code: PERMISSION_DENIED

or

java.sql.SQLRecoverableException: legacy_code: NOT_IN_TX_LEGACY
message: "vtgate: http://default-vtgate-us-east-1e-46-3400743101-d4wph:15101/: execInsertUnsharded: target: Contacts.0.master, used tablet: qa_iad-741647200 (172.18.132.166), vttablet: rpc error: code = Aborted desc = Deadlock found when trying to get lock; try restarting transaction (errno 1213) (sqlstate 40001) during query: insert into list_filter_counts(someField1,someField2) values (x,x) on duplicate key update someField1 = values(someField1), someField2 = values(someField2), CallerID: ContactsApiWeb-web.hs-contactsbackend.vitess"
code: ABORTED

These errors leave an engineer wondering -- why is my code hitting a legacy error? what's the actual issue here? what's the difference between legacy code and code?

I'd like to look at making these errors more user friendly. I dug into the JDBC driver, and it just calls TextFormat.printToString on the RPCError (which is a protobuf). This is what determines the format and order of the legacy_code, message, and code fields. I can make things a little more friendly by writing a better formatter which moves these around. I.E.:

java.sql.SQLRecoverableException: 
vtgate: http://default-vtgate-us-east-1e-46-3400743101-d4wph:15101/: execInsertUnsharded: target: Contacts.0.master, used tablet: qa_iad-741647200 (172.18.132.166), vttablet: rpc error: code = Aborted desc = Deadlock found when trying to get lock; try restarting transaction (errno 1213) (sqlstate 40001) during query: insert into list_filter_counts(someField1,someField2) values (x,x) on duplicate key update someField1 = values(someField1), someField2 = values(someField2), CallerID: ContactsApiWeb-web.hs-contactsbackend.vitess
code: ABORTED

Just ditching the confusing legacy code makes things slightly better, but it's still not great. There's still a redundant rpc error: code = Aborted, the useless (to them) vtgate address, a few cryptic things (to an end-user) in execInsertUnsharded and qa_iad-741647200, the lack of newlines, and non-ideal ordering of fields.

Unfortunately the rest of the problem comes from the message field, which is constructed somewhere between vttablet and vtgate and would be a pain to re-format in the JDBC due to its unstructured nature.

Based on the errors i've seen I can imagine a few potential changes to vterrors and RPCError:

All of these changes together could result in a message like this:

java.sql.SQLRecoverableException: Deadlock found when trying to get lock; try restarting transaction (errno 1213) (sqlstate 40001) during query: insert into list_filter_counts(someField1,someField2) values (x,x) on duplicate key update someField1 = values(someField1), someField2 = values(someField2)
code: ABORTED
vtgate: http://default-vtgate-us-east-1e-46-3400743101-d4wph:15101/
vttablet:  Contacts.0.master, used tablet: qa_iad-741647200 (172.18.132.166)
trace: execInsertUnsharded
CallerID: ContactsApiWeb-web.hs-contactsbackend.vitess

We could also add a query field if we wanted rather than using the during query: ... format.

Thoughts on these changes, or any ideas on other ways of making user-friendly error messages?

bbeaudreault commented 7 years ago

It may make sense to specifically mark errors from mysql. Even if the last exception above is clearer, people still would come to us not knowing if it's a mysql error or a vitess error.

GuptaManan100 commented 3 years ago

@harshit-gangal our error handling has improved significantly since this issue was created. Is there something left for us to do before closing it?

harshit-gangal commented 3 years ago

We still do not format the way it is shown above. We can keep this open.