smallstep / certificates

🛡️ A private certificate authority (X.509 & SSH) & ACME server for secure automated certificate management, so you can use TLS everywhere & SSO for SSH.
https://smallstep.com/certificates
Apache License 2.0
6.77k stars 442 forks source link

[Bug]: Step SSH CA getting "commands out of sync. You can't run this command now" error while storing tokens in a MYSQL backend. #1189

Open nikhilk1701 opened 1 year ago

nikhilk1701 commented 1 year ago

Steps to Reproduce

Out current setup:

  1. We have three step-ca servers running off the same mysql backend.
  2. The CA servers are configured to use keycloak as an OIDC provisioner.

Steps to reproduce:

  1. step ssh login <keycloak-user-email-id>
  2. Login into keycloak via the web interface the step client opens.
  3. After logging in, the webpage says token can be found in the command line but the terminal says The certificate authority encountered an Internal Server Error. Please see the certificate authority logs for more info. Re-run with STEPDEBUG=1 for more info.
  4. Logs from the CA server say

time="2022-11-21T05:41:36Z" level=error duration=2.102361ms duration-ns=2102361 error="authority.Authorize: authority.authorizeSSHSign: failed when attempting to store token: error storing used token used_ott/: commands out of sync. You can't run this command now" fields.time="2022-11-21T05:41:36Z" method=POST name=ca ott= path=/ssh/sign protocol=HTTP/2.0 referer= remote-address= request-id=cdtgv43l6md73t0rqkhg size=148 status=500 user-agent="Smallstep CLI/0.22.0 (darwin/amd64)" user-id=

MYSQL documentation says the error commands out of sync. You can't run this command now occurs when in the client code, you are calling client functions in the wrong order - reference

Your Environment

Expected Behavior

That the CA server signs the users key without any error.

Actual Behavior

After the OIDC provisioner authentication step is done, the CA is erroring out when it's trying to communicate with it's mysql backend in the process of signing the authenticated user's public key.

Additional Context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

dopey commented 1 year ago

Hey @nikhilk1701 👋. Thanks for opening the issue!

I'll look into this, but was hoping we could get a little more context on how the errors are coming about.

  1. Does this error happen every time you make this request? Or is it transient?
  2. Does it matter how long the CAs have been running? Is this happening right after they boot up, or do they need to be running for a little while before the errors start happening.
  3. I notice you're running step-ca in HA mode (3 instances connected to one backend). There are certain commands for which that could be an issue, but this isn't one of the ones where I might expect issue. Do you know if the same error occurs if you only run one instance?

Answers to these questions will help narrow down our search. My gut tells me there's an issue in the mysql client code in smallstep/nosql -- but I want to rule out other possibilities before digging in there. Also, if the issue is in the MySQL client that would be surprising since it's one of the database integrations that's been around the longest; would have expected an issue like this to come to light sooner.

nikhilk1701 commented 1 year ago

Hey @dopey 👋. Thanks for picking up the issue. To answer your questions,

  1. This is a transient error, occurring intermittently. I have observed this happing three times in a row at max. We also noticed the frequency of these errors increasing with age of the instances.
  2. While trying to reproduce the error, I have noticed that the issue is popping up as soon as I increased the number of instances from one to three. I have observed this shift earlier too, when we are first setting up HA with step-ca:v0.22.0.
  3. I could not reproduce this error with single instance.

To give you some more context, our mysql backend is a 3-node HA setup exposed via a mysql router.

dopey commented 1 year ago

Heyo, sorry for the radio silence.

Had another chance to discuss with the team and came up with a couple more questions. I'll add those at the bottom.

But I had a different question first: did you purposefully remove your OTT from the error you pasted in the first comment? error="authority.Authorize: authority.authorizeSSHSign: failed when attempting to store token: error storing used token used_ott/: commands out of sync. You can't run this command now". Here's a bit of code that wraps the error: https://github.com/smallstep/certificates/blob/5637e4ac655c2f281af8f6aff414d144301b1ec2/db/db.go#L403-L404. It should be printing the table name followed by the token but in your pasted error the token is empty. That behavior is concerning - unless you purposefully removed the token manually before posting.

A bit of background before more questions. We have integrations with small number of databases and each of these integrations implements the same generalized API. The error in smallstep/certificates is coming from here: https://github.com/smallstep/certificates/blob/5637e4ac655c2f281af8f6aff414d144301b1ec2/db/db.go#L401. And here is the relevant code in our mysql implementation https://github.com/smallstep/nosql/blob/master/mysql/mysql.go#L151-L194. You'll see that a transaction is being created, executed, and then committed. This should be an atomic operation, so it's unclear to me how we could be getting an "operations out of sync" error.

Now to the questions:

  1. Is there any additional stack trace from step-ca. I can see where the error is coming from in step-ca but not in the smallstep/nosql client library. Most of the errors in the client library are wrapped - but this one isn't. There are a couple unwrapped errors I can see in the cmpAndSwap code, but it feels strange for any of those to be triggering an error. A stack trace would help with identifying the exact line the error is coming from.
  2. Are there any errors in the mysql log? This seems like a client error, but doesn't hurt to check.
  3. Can you be more precise about your mysql setup and specifically the router? What kind of router is it and what's the version?

thanks!

dopey commented 1 year ago

Based on the fact the error isn't wrapped - I would guess it's coming from here: https://github.com/smallstep/nosql/blob/a0862c983a021ffb45d3d4a0c189104f1f84089b/mysql/mysql.go#L181-L185.

nikhilk1701 commented 1 year ago

Hey, I've purposefully removed the token from the error message, I was being careful(in a non-sensical way maybe). The actual error message has the token printed and looks like this -

time="2022-10-20T09:37:08Z" level=error duration=84.693869ms duration-ns=84693869 error="authority.Authorize: authority.authorizeSSHSign: failed when attempting to store token: error storing used token used_ott/e54adbf18c1e6bcc9aee680f1e4ba458248b799ec2a98b3a6c2aa285e7a16c87: commands out of sync. You can't run this command now" fields.time="2022-10-20T09:37:08Z" method=POST name=ca ott=eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJrZEpHMkJXRUNhNlJBZHNMM2w3X1NtSFYyRVdPLUZ3Tjg0YlpBZ29mTVY0In0.eyJleHAiOjE2NjYyNTg5MjcsImlhdCI6MTY2NjI1ODYyNywiYXV0aF90aW1lIjoxNjY2MjU4NTYzLCJqdGkiOiJiYzY5MDM5YS0yYTc0LTRkZjQtOGNjNS1lMTNlMjViZjk2NTIiLCJpc3MiOiJodHRwczovL3Nzby5hbHBob25zby50di9yZWFsbXMvQWxwaG9uc28iLCJhdWQiOiJzdGVwLWNhIiwic3ViIjoiODM4MThmYmMtYjc1YS00NzdmLWJiMDktNDg5YmU4YjMxYmRjIiwidHlwIjoiSUQiLCJhenAiOiJzdGVwLWNhIiwibm9uY2UiOiJlNTRhZGJmMThjMWU2YmNjOWFlZTY4MGYxZTRiYTQ1ODI0OGI3OTllYzJhOThiM2E2YzJhYTI4NWU3YTE2Yzg3Iiwic2Vzc2lvbl9zdGF0ZSI6ImE2ZDVhZWU0LTg2MTItNDdmNi1hN2Q3LTRkODU0NGZkODYxOCIsImF0X2hhc2giOiJqSjhmQ2lHTWthTGtOMkZPRkFxcGFBIiwiYWNyIjoiMCIsInNpZCI6ImE2ZDVhZWU0LTg2MTItNDdmNi1hN2Q3LTRkODU0NGZkODYxOCIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlLCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJuaWtoaWxrb21taW5lbmlAbGdhZHMudHYiLCJlbWFpbCI6Im5pa2hpbGtvbW1pbmVuaUBsZ2Fkcy50diJ9.WZNo6i-7w8npOW_OeVk2gKIlg4ier59udwom0SXaUxsUz4Weasu_P27jjbxtBkp6NkMNKf19MdWS6Q11kx-JUap11FbFZtym7cQxayPJlhxR61BEPti_evJx5f7UKnKyaaQrS7rwzeedW5JEjrQR0E8BV7aQZPXyhc0BUK6Kf1TRzHR4T3qzyHZBMgrJNFBRop858m3zmEb16lmp7Yv-CHER7MalrYPTKQZezcJ6d5q5h6aehB-i5ngCmZnRj5PbJjOzhxFNcPi9FSoPG82nAFrBHOUU1Rw3ZJW7BZARhsrJg7XGRCaLRbpJmfvOl4yBtXT9ffXZ8DuAmijXxa8_UQ path=/ssh/sign protocol=HTTP/2.0 referer= remote-address=208.85.1.210 request-id=cd8hdh3pi6rc4fev3hd0 size=148 status=500 user-agent="Smallstep CLI/0.22.0 (darwin/amd64)" user-id=

I haven't purposefully removed the token from the database.

To answer your questions:

  1. I couldn't get more logs from step-ca. Setting STEP_DEBUG to 1 didn't help, I was getting the same log messages.
  2. No, mysqld did not log any error messages.
  3. We are using the standard mysql router. Mysql router version: MySQL Community - GPL version 8.0.30
dopey commented 1 year ago

Hey @nikhilk1701 we've been unable to replicate the behavior locally and as such, we're not really able to debug further.

If you find a more consistent way for us to replicate the behavior please let us know.