ministryofjustice / nvvs-devops

Documentation for the NVVS DevOps Team
https://ministryofjustice.github.io/nvvs-devops
MIT License
4 stars 0 forks source link

SPIKE - Investigate DNS/DHCP portal issue #414

Closed tomwells98 closed 1 year ago

tomwells98 commented 1 year ago

What: A spike to investigate why users were seeing the below error message on the portal:

image_720.png

juddin927 commented 1 year ago

We have observed following errors/warnings whilst investing the the portal "page not found issue"

1.Database error DHCP ADMIN LOGS

(UTC+1:00) [22a912f1-0e4f-40fe-877e-b1196541d8a7] Gateways::KeaControlAgent::InternalError (unable to execute for <SELECT address, hwaddr, client_id, valid_lifetime, expire, subnet_id, fqdn_fwd, fqdn_rev, hostname, state, user_context FROM lease4 WHERE subnet_id = ?>, reason: Commands out of sync; you can't run this command now (error code 2014)):

  1. staff-device-production-dhcp-primary-service

2023-07-27 06:50:00.869 ERROR [kea-dhcp4.bad-packets/119.140385506216760] DHCP4_PACKET_NAK_0001 [hwtype=1 40:b0:34:9c:40:1a], cid=[01:40:b0:34:9c:40:1a], tid=0xd3e0000: failed to select a subnet for incoming packet, src 10.150.211.1, type DHCPDISCOVER

  1. ERROR [kea-dhcp4.packets/119.140385506216760] DHCP4_PACKET_SEND_FAIL [hwtype=1 a0:ce:c8:90:87:d5], cid=[01:a0:ce:c8:90:87:d5], tid=0x14f25a27: failed to send DHCPv4 packet: pkt4 send failed: sendmsg() returned with an error: Permission denied

error -"a0:ce:c8:6a:de:eb" -"a0:ce:c8:f8:17:20"

  1. HA_LEASE_UPDATE_FAILED

2023-07-27 11:28:12.040 WARN [kea-dhcp4.ha-hooks/119.140385444977464] HA_LEASE_UPDATE_FAILED [hwtype=1 04:0e:3c:77:6a:79], cid=[01:04:0e:3c:77:6a:79], tid=0x19f6964a: lease update to standby (http://10.180.81.4:8000) failed: failed to update the lease with address 10.81.124.127 either because the lease has been deleted or it has changed in the database, in both cases a retry might succeed, error code 1

juddin927 commented 1 year ago
  1. Database query error cleared after rebooting the DHCP server container
  2. DHCPDISCOVER error happening due to missing scoping for incoming source ip/subnet its a known issue. Call been raised separately to address that particular source IP with LAN
  3. pkt4 error is a known pre-existing error there is an existing ticket in the repo for this already. Its an old issue my research shows it is something to do the permission with the user that runs the kea agent. or could be am issue with the network policy blocking the traffics.
  4. "HA_LEASE_UPDATE_FAILED" this issue has been cleared, This could have caused by the database issue stated in number 1 and may have queued few days until it fully cleared.
satishgummadellimoj commented 1 year ago

After searching about the DB error, these are the findings:

"Commands out of sync" issue within the Gateways::KeaControlAgent, this can occur :

1) multiple queries running on the same connection without handling the result sets properly

2) query is not properly fetched, and the cursor is advanced through the result set before executing the next query on the same connection

satishgummadellimoj commented 1 year ago

the error code 2014 is usually associated with MySQL.

This error occurs when you have executed a query that returned a result set but didn't fetch all the rows from that result set before executing another query on the same database connection. It can happen if you have multiple active queries on the same connection without properly handling the results.

satishgummadellimoj commented 1 year ago

What we can do :

Check for unclosed connections: Ensure that you are closing database connections after you are done using them. Open connections can lead to this error if you try to execute new queries without closing the previous ones.

Update Gems: Ensure that all your gems, including the MySQL gem and any other relevant gems, are up to date.

Consider using connection pooling: Connection pooling can help manage database connections efficiently and avoid issues with unclosed connections.

Enable Database Query Logging: Enable logging of database queries for the affected environment (development or staging) to understand the sequence of queries being executed. This can help you trace the issue and identify where the problem lies.

satishgummadellimoj commented 1 year ago

Currently in our dns-dhcp-admin code we are not specifying any connection pool, with mysql2 gem for ruby on rails it uses pool size 5 as default.

satishgummadellimoj commented 1 year ago

tickets, 1) Understand how DHCP Admins invokes Kea agent running on DHCP Server 2) Enable Database query logging

satishgummadellimoj commented 1 year ago

https://app.zenhub.com/workspaces/nvvs-devops-622a0b371800e400133bb924/issues/gh/ministryofjustice/nvvs-devops/418

https://app.zenhub.com/workspaces/nvvs-devops-622a0b371800e400133bb924/issues/gh/ministryofjustice/nvvs-devops/417