Closed ravmoj closed 1 year ago
Hi @ravmoj ,
Can you perform few things as mentioned below in order to investigate more on this issue.
Are there any additional debug logs in prod auth API services in order to determine whether timeout requests ever actually reach the endpoint?
Can you simulate traffic in the non-prod to replicate this error? If you can replicate, you can try the 2 things
enable cross-namespace traffic routing between client services and auth endpoint and observe traffic behaviour. You can setup the network policy to connect between 2 services inside the cluster. https://user-guide.cloud-platform.service.justice.gov.uk/documentation/networking/network-policy.html
Temporarily enable access to auth API via default nginx ingress.
Hi @poornima-krishnasamy , Thanks your comments . Updated starts for last 24 hours in https://dsdmoj.atlassian.net/browse/HAAR-1392 Also as per suggestions : https://github.com/ministryofjustice/cloud-platform-environments/pull/12795 https://github.com/ministryofjustice/hmpps-auth/pull/1519
Can you please review cloud platform changes and ngnix ingress changes? Thank you
Hi @ravmoj , I have added a comment on the PR. Can u please have a look. https://github.com/ministryofjustice/cloud-platform-environments/pull/12795#discussion_r1183938993.
Thanks
Hi @poornima-krishnasamy , Thanks for your time today. As discussed, changed 3 services hmpps-manage-users-api, hmpps-audit-api,hmpps-external-users-api to enable internal service interaction with auth.
https://github.com/ministryofjustice/hmpps-manage-users-api/pull/272 https://github.com/ministryofjustice/hmpps-audit-api/pull/171 https://github.com/ministryofjustice/hmpps-external-users-api/pull/122 Have updated statistics for 7 days / 24 hours to JIRA: https://dsdmoj.atlassian.net/browse/HAAR-1392
Thank you
User is testing the networkpolicy to access the services internally. Moving it to blocked until user finishes testing
Discussed with @poornima-krishnasamy ,
After changing to internal call for 4 services (token-verification-api, hmpps-manage-users-api, hmpps-audit-api and hmpps-manage-users-api )with hmpps-auth, response times of jwks endpoint has improved significantly.
To see whether MOD security has any role in slow response times, disabling it in DEV to test the same
Other services still taking more response time in DEV.
logged all changes to : https://dsdmoj.atlassian.net/browse/HAAR-1392
Hi @poornima-krishnasamy , after taking off Mod security from HMPPS-AUTH service Last 24 hours stats shows fewer came under category of 7-15 secs (may be we have to take longer duration to compare)
144 took more than half sec Period: 25/05/2023 13:24 to 26/05/2023 13:24 Total request : 5906 144 took more than 5 sec to respond 2.44% request took more than half sec to serve 'GET /auth/.well-known/jwks.json' 9 took more than half sec Period: 01/06/2023 13:24 to 02/06/2023 13:24 Total request : 4496 9 took more than 5 sec to respond 0.09% request took more than half sec to serve 'GET /auth/.well-known/jwks.json'
Stats for one week period: With MOD security 727 took more than half sec Period: 17/05/2023 08:50 to 24/05/2023 08:50 Total request : 29359 2.4762% request took more than half sec to serve 'GET /auth/.well-known/jwks.json' 17-24/05/2023 :Go to Log Analytics and run query
Without ModSecurity 40 took more than half sec Period: 31/05/2023 17:30 to 07/06/2023 13:24 Total request : 23484 0.1703% request took more than half sec to serve 'GET /auth/.well-known/jwks.json' 31/05 to 07/06/2023 : Go to Log Analytics and run query (edited)
Hi @poornima-krishnasamy , thanks for your time. Please let us know next steps , you may require from us to investigate MOD security
Hi @ravmoj , thanks for the investigation. It is clear that the % of requests which has response more that half sec is much small when using the internal service url and with default ingress controller.
I understand that the modsecurity plays a vital role in processing each request. To rule out the case of bottleneck from modsec ingress controller pods, we will increase the number of replicas of modsec. I will let you know once this is done and lets see the stats for a week if that has improved any.
Alternatively, I would suggest to continue to use internal service url if you want to reduce the 2.4% slow response.
Hi @ravmoj , before I increase the number of replicas, can we take a pre-stats of
Hi @poornima-krishnasamy , thanks for your time today. As discussed will investigate point 4 and get back to u. it would be great, if you could let us know further findings with respect to modsec. thanks
Hi @ravmoj , after discussing with yourself and @mattops , its found that on load testing of any endpoint e.g ping from one namespace to another endpoint inside the cluster, there are 2 - 3% slow responses. This is happening when the modsec is turned off and the secRuleEngine: Off as well.
2 options we are planning as next steps:
i will let u know after doing the above
Hi @ravmoj , while doing the load testing with ab on test cluster, I can see the slow responses in general for 1-2% of traffic. Not sure if this is specific to the JWK responses too. I was able to get a better response after increasing the pod replicas.
I think we havent tested this before. Can you please increase the number of replicas of the hmpps-auth pods and observe the responses.
Thanks, Poornima.
Also, I have tested with changing the modsecurity configurations by turning off the below but, the % of slow responses are the same. SecRuleEngine Off and SecAuditEngine Off.
So it looks there is no performance changes in terms of modsecurity configuration.
Hi all, I have spun a new modsec controller in live cluster. modsec-02
. Can you please point any of your non-prod hmpps-auth ingress to the new modsec ingressClass modsec-02
.
And also, take some initial numbers on how the performance, % of slow responses etc. Once that is done and recorded, let us know here. The things that are discussed to modify in ingress-controller config
SecResponseBodyAccess Off
SecAuditEngine Off
and check the numbers for every change in the config
Potential leads around sockets exceeding the kernel limit and being dropped:
https://deploy.live/blog/kubernetes-networking-problems-due-to-the-conntrack/
After ModSecurity update on 10/07/2023 74 took more than half sec 117 took more than 250 milli seconds Period: 11/07/2023 11:00 to 19/07/2023 11:00 Total request : 154743 0.0478 % request took more than half seconds to serve 'GET /auth/.well-known/jwks.json' 0.0756 % request took more than 250 milli seconds to serve 'GET /auth/.well-known/jwks.json'
Service name
hmpps-auth
Service environment
Impact on the service
Provide real impact description on the service mentioned. It can include any potential blockers for the product team.
While services requesting for jwks file from auth endpoint (GET /auth/.well-known/jwks.json) failed with ReadTimeoutException. (Slack link) Auth being a central service , it's effecting all dependent services.
Problem description
Added details to the document. (https://docs.google.com/document/d/1XR3zbnRNQ7du7PBfWrTClc3hKzmtVn46MHCPZ8hdPV4/edit)
Contact person
Name: Ravindra Kumar (#hmpps-auth-audit-registers) email: ravindra.kumar@digital.justice.gov.uk slack handle:https://mojdt.slack.com/archives/D03D551B9T4