Open cbrianhill opened 1 month ago
Hi @cbrianhill .
At the moment there is little evidence that LOAD MYSQL QUERY RULES TO RUNTIME
and the crash are related, and further investigation is required.
Did you add/edit any query rule that may have affected the query in the backtrace?
If yes, than there is a possible lead.
Hi @cbrianhill . At the moment there is little evidence that
LOAD MYSQL QUERY RULES TO RUNTIME
and the crash are related, and further investigation is required. Did you add/edit any query rule that may have affected the query in the backtrace? If yes, than there is a possible lead.
Hi @renecannao , I suspected that the crash site might not be directly related to the change in query rules. The change we made to the query rules was to set a value for the error_msg
column. We are using an Aurora database, and updated the rules that route queries to the writer so that they would return the given error_msg
. We are also using the ProxySQL cluster functionality to allow the proxies to share updated configuration.
We had deployed version 2.6.3, but believe we ran into https://github.com/sysown/proxysql/issues/4572 and had recently updated our clusters to version 2.6.6.
I'm going to work on a more solid repro scenario today and will update this issue with anything I can find.
Hi @cbrianhill . At this stage, I would say that the query rules update is not relevant at all, and I suspect that ProxySQL Cluster is not relevant at all either.
The backtrace points to a crash related to the fetching of rows (or, potentially columns definitions) during the execution of a prepared statements.
At this stage my suspicious leans more toward a received malformed packet while retrieving the resultset, due to a broken connection or any other error.
Since you can't share the core dump, I would suggest to try to analyze the retrieved resultset/row (ret=0x7fbb7118dc00, result=0x7fbb7264fe40
, and potentially also checking the network buffer in the mysql
structure.
@renecannao , thank you for the tips! Really appreciate them. I can safely share the resultset here, which looks interesting to me:
(gdb) p *result
$4 = {row_count = 24, field_count = 6, current_field = 6, fields = 0x7fbb72d50050, data = 0x0, data_cursor = 0x0, field_alloc = {free = 0x0,
used = 0x0, pre_alloc = 0x0, min_malloc = 32, block_size = 8168, block_num = 4, first_block_usage = 0, error_handler = 0x0},
row = 0x7fbb74067480, current_row = 0x7fbb74067480, lengths = 0x7fbb7264fec8, handle = 0x0, eof = 1 '\001', is_ps = 0 '\000'}
I'm not familiar with the code in libmariadb, but it seems like a problem that handle
is null here (at least it seems like that will be accessed immediately in mysql_fetch_row_cont()
).
The row looks like this:
(gdb) p ret
$1 = (MYSQL_ROW *) 0x7fbb7118dc00
(gdb) p *ret
$2 = (MYSQL_ROW) 0x7fbb71463340
(gdb) p **ret
$3 = 0x7fbb5be07001 "[redacted]\031[redacted]\374>\001{\"workerId\":\"[redacted]\",\"wssUri\":\"wss://[redacted]:443\",\"sigGroupTs\":1728185194668,\"signa"...
The network buffer in the mysql
struct looks like this:
(gdb) p mysql->net
$4 = {pvio = 0x7fbb710d0690,
buff = 0x7fbb5be07000 " [redacted]\031[redacted]\374>\001{\"workerId\":\"[redacted]\",\"wssUri\":\"wss://[redacted]\",\"sigGroupTs\":1728185194668,\"sign"...,
buff_end = 0x7fbb5be0a000 "', '[redacted]', '*', 'kitchen-sync-presence', 1727803583643, 1727803583953, '{\\\"workerId\\\":\\\"[redacted]\\\",\\\"wssUri\\\":\\\"wss://[redacted]\\\",\\\"s"...,
write_pos = 0x7fbb5be07000 " [redacted]\031[redacted]\374>\001{\"workerId\":\"[redacted]\",\"wssUri\":\"wss://[redacted]\",\"sigGroupTs\":1728185194668,\"sign"...,
read_pos = 0x7fbb5be07000 " [redacted]\031[redacted]\374>\001{\"workerId\":\"[redacted\\",\"wssUri\":\"wss://[redacted]\",\"sigGroupTs\":1728185194668,\"sign"..., fd = 141, remain_in_buf = 0, length = 0,
buf_length = 0, where_b = 0, max_packet = 12288,
max_packet_size = 1073741824, pkt_nr = 187, compress_pkt_nr = 187,
write_timeout = 0, read_timeout = 30, retry_count = 0, fcntl = 0,
return_status = 0x0, reading_or_writing = 1 '\001',
save_char = 0 '\000', unused_1 = 0 '\000', unused_2 = 0 '\000',
compress = 0 '\000', unused_3 = 0 '\000', unused_4 = 0x0,
last_errno = 0, error = 0 '\000', unused_5 = 0 '\000',
unused_6 = 0 '\000', last_error = '\000' <repeats 511 times>,
sqlstate = "00000", extension = 0x7fbb728d3140}
Hi there @renecannao , I wanted to just ping and see if the latest comment on this issue provides any ideas or intuition for you as to what the cause of the crash might be. We appreciate your engagement here, and, of course, don't have any expectations regarding further support, but I also wanted to mention that we've tested the same operations in our test environment under load, and have not been able to reproduce the scenario.
FWIW, the broader context around our efforts is a MySQL upgrade from version 5.7 to version 8 (well, the corresponding Aurora versions). We've tested using the AWS blue/green deployment feature for completing the database upgrade we're planning, since we're on an Aurora database, and discovered https://github.com/sysown/proxysql/issues/4223 which essentially boils down to the fact that AWS' blue/green deployments break the Aurora autodiscovery functionality within ProxySQL. We can make another attempt in our production environment, which may work well if the crash was due to something truly extraordinary.
We've observed a crash in ProxySQL. The crash happens immediately after running
LOAD MYSQL QUERY RULES TO RUNTIME
.We observed this in the logs:
Additional details:
mysql_query_rules
LOAD MYSQL QUERY RULES TO RUNTIME
The core dump likely includes sensitive information and may be difficult to share.
However, I can share most of the backtrace from gdb:
We experienced this on our production server, under a small amount of load, but did not experience it while making the same update in our testing environment. I'm planning to continue working on a better repro scenario.