tpm2-software / tpm2-openssl

OpenSSL Provider for TPM2 integration
BSD 3-Clause "New" or "Revised" License
82 stars 36 forks source link

Error: Esys called in bad sequence #118

Open Danigaralfo opened 1 month ago

Danigaralfo commented 1 month ago

Hello,

We have encountered an issue in how the provider behaves when a process spawns two or more threads that access the TPM. In our case, this process was an Apache server that uses the TPM as a generator/storage for the private key. We have built the provider with debug mode enabled as well as activated the tpm2-tss traces.

Depending on the browser, this issue appears or not: with Firefox does not appear, but with Chrome/Edge does. This is because Firefox only sends one request, while Chrome/Edge send two (spawning three threads in the Apache process, one for key loading, and two for request processing).

You can see the logs in chrome_logs.txt. The flow would be as follows (In a general way. The order is not exactly like this always, as one thread could have started the TPM operations before the other one gets even started, but you get the point):

  1. The two requests are received, so the Apache process spawns the three mentioned threads.
  2. One stand-alone thread does some RSA operations, probably because it's re-generating the private key with the context saved in the private key file.
  3. When the key has being loaded by this stand-alone thread, the two other threads start the random number generation:
  4. When the two threads start the operations, the second one fails (lines 16 to 36 in chrome_logs.txt), returning an ESAPI async sequence error.
  5. The Apache server returns an internal SSL error, and the browser does not load the web page, even when the first spawned thread finish flawlessly.

This behaviour could also be observable in the signs operations. Seeing this, we thought that this could be a synchronization problem between the two threads accessing the TPM. This shouldn't be a problem, since the abrmd is a thing, but the daemon manages the access to the TPM at a process level (it uses the Dbus): it delegates the thread access to the developer.

In order to debug more, we tried to address the issue. We found some problems with the RAND operations order, so we ended up deactivating the rand operations in OpenSSL with the provider. We did as follows:

  1. Deactivated the RAND operations in OpenSSL configuration.
  2. Added some mutex to control the execution of the threads. I can provide the file with the mutexes if required.
  3. Changed the traces of the provider, so we could be able to see which thread was doing the TPM operations and which mutex was being locked. Built the modified provider, load it into OpenSSL and restart the Apache process.
  4. Make the requests with the browser that was failing before.

You can see the logs of this execution in chrome_sign_mutex_logs.txt. Now, the execution would go as follows (I will skip the rand generation since it is not being done in the TPM now):

  1. The two requests are received, so the Apache process spawns the two mentioned threads.
  2. One stand-alone thread does some RSA operations, probably because it's re-generating the private key with the context saved in the private key file.
  3. When the key has being loaded by this stand-alone thread, the first thread that gets into the sign operations claims the mutex twice (since DIGEST_INIT is done twice).
  4. The thread with the mutex start the signs operations. When SIGN DIGEST_SIGN estimate is called, the mutex is freed once.
  5. When the SIGN DIGEST_SIGN is done, the mutex is freed totally. Then, the other thread makes the same process.
  6. With all operations done, the Apache server returns the webpage and the browser displays it.

With this, the problem seems to be "solved", at least for the sign operations. We're looking into it more deeply, in order to make a better description of the problem, but some help may be useful.

Any suggestion is appreciated. Thanks.

chrome_logs.txt chrome_sign_mutex_logs.txt

Danigaralfo commented 1 month ago

To add more information, we're using the following versions:

Danigaralfo commented 1 month ago

We have found the issue. Long story short: the threads are using the same context at the same time. The provider should protect the Esys operations or treat them as atomic operations.

Let's take the random operations as an example. In the TSS function Esys_GetRandom_Async, the esys_ctx->state variable is set to ESYS_STATE_INTERNALERROR by default:

/* Check context, sequence correctness and set state to error for now */
    if (esysContext == NULL) {
        LOG_ERROR("esyscontext is NULL.");
        return TSS2_ESYS_RC_BAD_REFERENCE;
    }
    r = iesys_check_sequence_async(esysContext);
    if (r != TSS2_RC_SUCCESS)
        return r;
    esysContext->state = ESYS_STATE_INTERNALERROR;

If the second thread starts its execution after the first one set this variable to ESYS_STATE_INTERNALERROR, iesys_check_sequence_async(esysContext) will return an error. An example (the state lines were added by me in the code to check the value of esys_ctx->state):

[Tue Jul 16 13:34:56.371358 2024] [ssl:info] [pid 23727:tid 4142355392] [client 192.168.208.186:64050] AH01964: Connection to child 0 established (server 127.0.1.1:443)
RAND GET_CTX_PARAMS [ state ]
RAND NEW
RAND GET_CTX_PARAMS [ max_request ]
RAND GENERATE
[Tue Jul 16 13:34:56.374992 2024] [ssl:info] [pid 23727:tid 4121949120] [client 192.168.208.186:64051] AH01964: Connection to child 1 established (server 127.0.1.1:443)
RAND GET_CTX_PARAMS [ state ]
trace:esys:src/tss2-esys/api/Esys_GetRandom.c:228:Esys_GetRandom_Finish() context=0x1a5c6d8, randomBytes=0xf6e74a3c
[Tue Jul 16 13:34:56.375839 2024] [ssl:debug] [pid 23727:tid 4121949120] ssl_engine_kernel.c(2425): [client 192.168.208.186:64051] AH02645: Server name not provided via TLS extension (using default/first virtual host)
State before state-check in Esys_GetRandom_Finish(): 1 (tid: 4142355392)
RAND NEW
RAND GET_CTX_PARAMS [ max_request ]
RAND GENERATE
trace:esys:src/tss2-esys/api/Esys_GetRandom.c:140:Esys_GetRandom_Async() context=0x1a5c6d8, bytesRequested=0020
State before state-check in Esys_GetRandom_Async(): 3 (tid: 4121949120)
ERROR:esys:src/tss2-esys/esys_iutil.c:1145:iesys_check_sequence_async() Esys called in bad sequence.
ERROR:esys:src/tss2-esys/api/Esys_GetRandom.c:74:Esys_GetRandom() Error in async function ErrorCode (0x00070007)