ministryofjustice / nvvs-devops

Documentation for the NVVS DevOps Team
https://ministryofjustice.github.io/nvvs-devops
MIT License
4 stars 0 forks source link

SPIKE DHCP admin failing with mysql connection errors #485

Closed satishgummadellimoj closed 8 months ago

satishgummadellimoj commented 1 year ago

User Story: As a Devops engineer I want to understand how DHCP Admins invokes Kea agent running on DHCP Server.

UPDATE: The issue reoccurred on 11/12/2023 at 9:58. The issue was resolved by only redeploying the API service. Therefore we can narrow down the logs investigation to the API logs.

Context: We have seen a reoccurring issue with the DNS/DHCP portal that generated an error when users tried to click on sites. We could resolve this issue by redeploying the pipeline but we wanted to build an understanding of what was causing the issue. Satish has conducted work to look to understand how DHCP admins invoke kea running agent on DHCP server and has placed the findings below:

root issue https://app.zenhub.com/workspaces/nvvs-devops-622a0b371800e400133bb924/issues/gh/ministryofjustice/nvvs-devops/417

DHCP issue info: https://dsdmoj.atlassian.net/wiki/spaces/MOJO/pages/4550066177/DNS+DHCP+Portal+Working

What to do:

Time box: 2 days for this ticket

ticket 606 raised to cover work required as output of this spike

tomwells98 commented 1 year ago

Satish to update with to do list

darey-io commented 10 months ago

I have analysed the last 12 months metrics of the admin db so that we can identify if there is a pattern of spike in resource usage. But it all look healthy and there is no cause for alarm there.

With the error returned when this issue occurs, It relates to SQL concurrency and Query execution order issues.

Concurrency Issues:

Query Execution Order:

Potential solution

I will be reproducing the error next so that these database configuration changes can be tested.

darey-io commented 10 months ago

Started simulating the failure using the scripts below

Dockerfile `######### Use an official MySQL runtime as a parent image FROM mysql:latest

######## Environment variables for MySQL ENV MYSQL_ROOT_PASSWORD=root_password ENV MYSQL_DATABASE=my_database ENV MYSQL_USER=my_user ENV MYSQL_PASSWORD=my_password

######### Create a directory for scripts WORKDIR /docker-entrypoint-initdb.d

######### Copy SQL script to initialize database schema and dummy data COPY init.sql /docker-entrypoint-initdb.d/

######### Copy shell script to simulate concurrency issue COPY run_script.sh /docker-entrypoint-initdb.d/

COPY max_conn.sh /docker-entrypoint-initdb.d/

######### Grant permissions to the scripts RUN chmod +x /docker-entrypoint-initdb.d/init.sql /docker-entrypoint-initdb.d/run_script.sh

######### Expose MySQL default port EXPOSE 3306 `

Build and run image `#!/bin/bash

######### Stop all running containers docker stop $(docker ps -q)

######### Generate a unique tag using timestamp TAG="mysql-concurrency-test:$(date +%Y%m%d%H%M%S)"

######### Build the Docker image with a unique tag docker build -t "$TAG" .

######### Run the Docker container in detached mode with a specific name CONTAINER_NAME="my-container-$(date +%Y%m%d%H%M%S)" docker run -d --name "$CONTAINER_NAME" "$TAG"

######### Execute bash inside the running container docker exec -it "$CONTAINER_NAME" bash `

SQL `-- Create the database schema CREATE TABLE IF NOT EXISTS lease4 ( id INT AUTO_INCREMENT PRIMARY KEY, address VARCHAR(255), hwaddr VARCHAR(255), client_id VARCHAR(255), valid_lifetime INT, expire DATETIME, subnet_id INT, fqdn_fwd VARCHAR(255), fqdn_rev VARCHAR(255), hostname VARCHAR(255), state VARCHAR(255), user_context VARCHAR(255) );

-- Insert dummy data INSERT INTO lease4 (address, hwaddr, client_id, valid_lifetime, expire, subnet_id, fqdn_fwd, fqdn_rev, hostname, state, user_context) VALUES ('192.168.1.1', '00:11:22:33:44:55', 'client1', 3600, NOW(), 1, 'example.com', 'com.example', 'host1', 'active', 'context1'), ('192.168.1.2', '00:11:22:33:44:66', 'client2', 7200, NOW(), 2, 'example.org', 'org.example', 'host2', 'inactive', 'context2'); `

Simulate SQL error max_conn.sh `#!/bin/bash

######### Start the script in the background for i in {1..5}; do ./run_script.sh & done

######### Wait for all background processes to finish wait

######### After the script completes, query and display the max connections max_connections=$(mysql -u root -proot_password -e "SHOW STATUS LIKE 'Threads_connected';" | awk '{print $2}') echo "Max Connections Used: $max_connections"`

darey-io commented 10 months ago

Currently seeing maximum connections error based on the simulation. But we need to see an error specifically relating to "Commands out of sync; you can't run this command now (error code 2014))"

darey-io commented 10 months ago

In our attempt to reproduce this error because we cannot conclude the spike without being able to reporduce the error, I and Stephen figured out that our script is giving too many connections error instead of the "command out of sync" error.

Upon further digging we came across this post https://stackoverflow.com/questions/614671/commands-out-of-sync-you-cant-run-this-command-now

The summary is, the way the mysqli library which the application is using to interact with mysql is different from the way our shell script is interacting. Hence, the best way to reproduce this is to write an app (Ruby or PHP) that uses the mysqli library and as such we can simulate the error directly from the application.

darey-io commented 9 months ago

A ticket has been created to implement a simple application to use the mysqli library to simulate this failure. https://app.zenhub.com/workspaces/nvvs-devops-622a0b371800e400133bb924/issues/gh/ministryofjustice/nvvs-devops/577

darey-io commented 9 months ago

Below is insights into the SQL error following the discoveries from spike.

Definition of the "Command out of sync" error as documented by MySQL can be found here - https://dev.mysql.com/doc/refman/8.0/en/commands-out-of-sync.html. It occurs when in the client code (In this case KEA Server), is calling functions in the wrong order.

Breaking down the flow leading to the error....

  1. staff-device-dns-dhcp-admin interacts with staff-device-dhcp-server over http and calls the update_dhcp_config function which configures database connection and handles required changes to the DHCP infrastructure. reference --> https://github.com/ministryofjustice/staff-device-dns-dhcp-admin/blob/main/app/controllers/subnets_controller.rb#L39

  2. Part of the DHCP Server configuration has "multi-threading" enabled. The "Commands out of sync" error typically occurs in MySQL when commands are executed out of sequence. This is likely to happen when multithreading is enabled with a high thread pool size, and different threads are executing commands on the same connection, or when result sets from previous commands are not properly handled before new commands are executed. The specific setting that can cause this to happen is the "Thread Pool and Queue Size" parameters as configured here https://github.com/ministryofjustice/staff-device-dhcp-server/blob/main/dhcp-service/config_api.json#L65

  3. The "thread-pool-size" and "packet-queue-size" parameters control the number of threads and the size of the packet queue, respectively. A larger number of threads can increase the likelihood of concurrent database access problems if not managed correctly.

Recommendations:

  1. Ideally we would need to configure the client application (KEA Server) to handle the SQL querry more efficiently. However, this may not be a feasible option for us as the particular querry causing this error as reported here

https://app.zenhub.com/workspaces/nvvs-devops-622a0b371800e400133bb924/issues/gh/ministryofjustice/nvvs-devops/485 [22a912f-0e4f-40fe-877e-b1196541d8a7] Gateways::KeaControlAgent::InternalError (unable to execute for <SELECT address, hwaddr, client_id, valid_lifetime, expire, subnet_id, fqdn_fwd, fqdn_rev, hostname, state, user_context FROM lease4 WHERE subnet_id = ?>, reason: Commands out of sync; you can't run this command now (error code 2014)):

is found in the KEA source code here https://github.com/isc-projects/kea/blob/c48734c20e47355cc4b81d9316a582566eaa608b/src/lib/dhcpsrv/mysql_lease_mgr.cc#L141

  1. Adjusting Multi-Threading/thread pool Settings: As a diagnostic step, when the error occurs again, reducing the "thread-pool-size" and observing the experience if the issue persists. This can help determine if the issue is related to concurrent database access.

  2. Before the error occurs again, we can Implement monitoring and setup a dashboard to track specific behaviours, like when different threads from the KEA Server execute commands on the same connection or when result sets from previous commands are not properly handled. This can be quite complex to do. However, it is feasible with a combination of MySQL's internal logging mechanisms, integration with Prometheus and Grafana and possibly some custom scripting. If we would engage in an activity like this, it would involve steps like...

darey-io commented 9 months ago

This has been documented in Confluence https://dsdmoj.atlassian.net/wiki/spaces/MOJO/pages/4651417673/SPIKE+DHCP+admin+failing+with+mysql+connection+errors