ministryofjustice / cloud-platform

Documentation on the MoJ cloud platform
MIT License
87 stars 44 forks source link

Auth JWKS Response Times #4518

Closed ravmoj closed 1 year ago

ravmoj commented 1 year ago

Service name

hmpps-auth

Service environment

Impact on the service

Provide real impact description on the service mentioned. It can include any potential blockers for the product team.

While services requesting for jwks file from auth endpoint (GET /auth/.well-known/jwks.json) failed with ReadTimeoutException. (Slack link) Auth being a central service , it's effecting all dependent services.

Problem description

Added details to the document. (https://docs.google.com/document/d/1XR3zbnRNQ7du7PBfWrTClc3hKzmtVn46MHCPZ8hdPV4/edit)

Contact person

Name: Ravindra Kumar (#hmpps-auth-audit-registers) email: ravindra.kumar@digital.justice.gov.uk slack handle:https://mojdt.slack.com/archives/D03D551B9T4

poornima-krishnasamy commented 1 year ago

Hi @ravmoj ,

Can you perform few things as mentioned below in order to investigate more on this issue.

  1. Are there any additional debug logs in prod auth API services in order to determine whether timeout requests ever actually reach the endpoint?

  2. Can you simulate traffic in the non-prod to replicate this error? If you can replicate, you can try the 2 things

ravmoj commented 1 year ago

Hi @poornima-krishnasamy , Thanks your comments . Updated starts for last 24 hours in https://dsdmoj.atlassian.net/browse/HAAR-1392 Also as per suggestions : https://github.com/ministryofjustice/cloud-platform-environments/pull/12795 https://github.com/ministryofjustice/hmpps-auth/pull/1519

Can you please review cloud platform changes and ngnix ingress changes? Thank you

poornima-krishnasamy commented 1 year ago

Hi @ravmoj , I have added a comment on the PR. Can u please have a look. https://github.com/ministryofjustice/cloud-platform-environments/pull/12795#discussion_r1183938993.

Thanks

ravmoj commented 1 year ago

Hi @poornima-krishnasamy , Thanks for your time today. As discussed, changed 3 services hmpps-manage-users-api, hmpps-audit-api,hmpps-external-users-api to enable internal service interaction with auth.

https://github.com/ministryofjustice/hmpps-manage-users-api/pull/272 https://github.com/ministryofjustice/hmpps-audit-api/pull/171 https://github.com/ministryofjustice/hmpps-external-users-api/pull/122 Have updated statistics for 7 days / 24 hours to JIRA: https://dsdmoj.atlassian.net/browse/HAAR-1392

Thank you

poornima-krishnasamy commented 1 year ago

User is testing the networkpolicy to access the services internally. Moving it to blocked until user finishes testing

ravmoj commented 1 year ago

Discussed with @poornima-krishnasamy ,

After changing to internal call for 4 services (token-verification-api, hmpps-manage-users-api, hmpps-audit-api and hmpps-manage-users-api )with hmpps-auth, response times of jwks endpoint has improved significantly.

To see whether MOD security has any role in slow response times, disabling it in DEV to test the same

Other services still taking more response time in DEV.

image

logged all changes to : https://dsdmoj.atlassian.net/browse/HAAR-1392

ravmoj commented 1 year ago

Hi @poornima-krishnasamy , after taking off Mod security from HMPPS-AUTH service Last 24 hours stats shows fewer came under category of 7-15 secs (may be we have to take longer duration to compare)

image

ravmoj commented 1 year ago

144 took more than half sec Period: 25/05/2023 13:24 to 26/05/2023 13:24 Total request : 5906 144 took more than 5 sec to respond 2.44% request took more than half sec to serve 'GET /auth/.well-known/jwks.json' 9 took more than half sec Period: 01/06/2023 13:24 to 02/06/2023 13:24 Total request : 4496 9 took more than 5 sec to respond 0.09% request took more than half sec to serve 'GET /auth/.well-known/jwks.json'

ravmoj commented 1 year ago

Stats for one week period: With MOD security 727 took more than half sec Period: 17/05/2023 08:50 to 24/05/2023 08:50 Total request : 29359 2.4762% request took more than half sec to serve 'GET /auth/.well-known/jwks.json' 17-24/05/2023 :Go to Log Analytics and run query

Without ModSecurity 40 took more than half sec Period: 31/05/2023 17:30 to 07/06/2023 13:24 Total request : 23484 0.1703% request took more than half sec to serve 'GET /auth/.well-known/jwks.json' 31/05 to 07/06/2023 : Go to Log Analytics and run query (edited)

ravmoj commented 1 year ago

Hi @poornima-krishnasamy , thanks for your time. Please let us know next steps , you may require from us to investigate MOD security

poornima-krishnasamy commented 1 year ago

Hi @ravmoj , thanks for the investigation. It is clear that the % of requests which has response more that half sec is much small when using the internal service url and with default ingress controller.

I understand that the modsecurity plays a vital role in processing each request. To rule out the case of bottleneck from modsec ingress controller pods, we will increase the number of replicas of modsec. I will let you know once this is done and lets see the stats for a week if that has improved any.

Alternatively, I would suggest to continue to use internal service url if you want to reduce the 2.4% slow response.

poornima-krishnasamy commented 1 year ago

Hi @ravmoj , before I increase the number of replicas, can we take a pre-stats of

  1. When the 2.44% slow response time happening? i.e. the timings of the day
  2. What type of call? I am assuming its the GET request? Please confirm
  3. Number of rules enabled on the ingress to process any requests to the auth url
  4. Stats of any other GET request and response times that are send to Auth url to compare with the jwt.json
ravmoj commented 1 year ago

Hi @poornima-krishnasamy , thanks for your time today. As discussed will investigate point 4 and get back to u. it would be great, if you could let us know further findings with respect to modsec. thanks

poornima-krishnasamy commented 1 year ago

Hi @ravmoj , after discussing with yourself and @mattops , its found that on load testing of any endpoint e.g ping from one namespace to another endpoint inside the cluster, there are 2 - 3% slow responses. This is happening when the modsec is turned off and the secRuleEngine: Off as well.

2 options we are planning as next steps:

i will let u know after doing the above

poornima-krishnasamy commented 1 year ago

Hi @ravmoj , while doing the load testing with ab on test cluster, I can see the slow responses in general for 1-2% of traffic. Not sure if this is specific to the JWK responses too. I was able to get a better response after increasing the pod replicas.

I think we havent tested this before. Can you please increase the number of replicas of the hmpps-auth pods and observe the responses.

Thanks, Poornima.

poornima-krishnasamy commented 1 year ago

Also, I have tested with changing the modsecurity configurations by turning off the below but, the % of slow responses are the same. SecRuleEngine Off and SecAuditEngine Off.

So it looks there is no performance changes in terms of modsecurity configuration.

poornima-krishnasamy commented 1 year ago

Hi all, I have spun a new modsec controller in live cluster. modsec-02. Can you please point any of your non-prod hmpps-auth ingress to the new modsec ingressClass modsec-02.

And also, take some initial numbers on how the performance, % of slow responses etc. Once that is done and recorded, let us know here. The things that are discussed to modify in ingress-controller config SecResponseBodyAccess Off SecAuditEngine Off and check the numbers for every change in the config

jaskaransarkaria commented 1 year ago

Potential leads around sockets exceeding the kernel limit and being dropped:

https://thanos.live.cloud-platform.service.justice.gov.uk/graph?g0.expr=(node_nf_conntrack_entries%2Fnode_nf_conntrack_entries_limit)%20*%20100&g0.tab=0&g0.stacked=0&g0.range_input=1w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

https://thanos.live.cloud-platform.service.justice.gov.uk/graph?g0.expr=node_nf_conntrack_stat_drop&g0.tab=0&g0.stacked=0&g0.range_input=1w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

https://thanos.live.cloud-platform.service.justice.gov.uk/graph?g0.expr=node_nf_conntrack_stat_invalid&g0.tab=0&g0.stacked=0&g0.range_input=1w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

https://deploy.live/blog/kubernetes-networking-problems-due-to-the-conntrack/

ravmoj commented 1 year ago

After ModSecurity update on 10/07/2023 74 took more than half sec 117 took more than 250 milli seconds Period: 11/07/2023 11:00 to 19/07/2023 11:00 Total request : 154743 0.0478 % request took more than half seconds to serve 'GET /auth/.well-known/jwks.json' 0.0756 % request took more than 250 milli seconds to serve 'GET /auth/.well-known/jwks.json'

Azure App insights link