guardian / play-secret-rotation

Rotate your Application Secret on an active cluster of Play app servers
14 stars 3 forks source link

`CSRFTokenSigner` freezes its copy of the secret on startup, leads to unpredictable CSRF token rejection #445

Closed rtyley closed 6 months ago

rtyley commented 6 months ago

Investigating https://github.com/guardian/ophan/issues/5970, which was seeing many of these errors:

[CSRF] Check failed because no token found in headers for /tester/device-authorisation/process

...it's become clear that play-secret-rotation, which overrides the crucial RequestFactory component to adopt secret-rotating behaviour with RotatingSecretComponents:

https://github.com/guardian/play-secret-rotation/blob/4c96b9294e30f231713a79a0cb449668c81c7eba/play/play-v27/RotatingSecretComponents.scala#L22-L23

...doesn't also override CSRFTokenSigner.

Consequently, if app servers in a cluster have started at different times, before and after a secret rotation, as in this case~annotations~(vertical~(~(label~'Secret20rotation~value~'2024-03-20T143a243a32.000Z)~(color~'23d62728~label~'Secret20rotation~value~'2024-03-20T203a243a32.000Z)~(color~'23d62728~label~'Secret20rotation~value~'2024-03-20T083a243a32.000Z))))&query=~'7bAWS2fEC22cInstanceId7d20i-0dac56c6af2dbfdae) with the Ophan Dashboard:

image

...then different servers within a Play app server cluster will be using a different secret for signing the CSRF token - and consequently reject each others tokens. If a user is making a form-POST-ing request, it's a random chance as to whether their POST will be received by a Play server instance using the same secret as the form was created with - a random chance whether it will fail.

SHession commented 6 months ago

To be clear about the cause of the issue. Was the CSRFTokenSigner failing to validate old keys or rotate them to new ones or both?

rtyley commented 6 months ago

To be clear about the cause of the issue. Was the CSRFTokenSigner failing to validate old keys or rotate them to new ones or both?

The second one - CSRFTokenSigner was never updating the secret it was using to sign or verify tokens. The secret it used was entirely dependent on when that EC2 instance started up.