trezor / trezor-firmware

:lock: Trezor Firmware Monorepo
https://trezor.io
Other
1.31k stars 644 forks source link

Optiga not responding debugging #3739

Closed onvej-sl closed 3 months ago

onvej-sl commented 4 months ago

It seems to me that I've found the reason why optiga sometimes does not respond to a request. Which, based on my experience, happens once every hundreds to thousands of commands.

I don't have an explanation for it, but I experimentally found out that this is related to random waits, that are called when receiving or sending usb packet (see for example here).

I haven't experienced any such error since I removed the random waits, and I can artificially induce the error if I increase the wait time.

You can try it yourself. See this branch. The branch first logs some debugging information and second enables one to adjust the wait time in the prodtest in runtime. If I set the wait time to 0 ( SET WAIT TIME 0), I don't experience any errors. If I set the wait time to 2000, I experience this error. If I set the wait time to 5000, I experience this or this error.

The random waits serve as a protection against fault injection attacks. For that reason, they can be removed from the prodtest without worries, but not from the firmware. Furthermore, the firmware calls the random waits not only when sending and receiving usb packet but also at regular intervals (see this).

@andrewkozlik @TychoVrahe @hiviah

TychoVrahe commented 4 months ago

any specific command i should be sending to reproduce? trying with SET WAIT TIME 5000 and SEC READ and i see no errors.

andrewkozlik commented 4 months ago

I would certainly remove random waits and random delay interrupts from prodtest ASAP if it will help avoid issues with Optiga. I recall I saw some communication failures during production that are causing devices to be removed from the production line.

That being said, I am a little surprised that this could be due to USB communication going on while communicating with Optiga. I am not sure what it could be. Maybe some sort of keepalive packets?

Random waits and random delay interrupts operate completely differently. Random waits are explicitly used near sensitive code to prevent fault injection, namely USB code and storage code, I think around the counters when unlocking storage, for example. Random delay interrupts happen at regular intervals and I think they should take between 30 and 540 clock ticks. The two are also implemented differently. So I'd recommend first trying to figure out which of the two is causing the issue.

What are the units of the WAIT TIME mentioned above? Is it ms, microseconds or clock ticks? If it's ms, then I wouldn't be surprised that the communication is failing for such values.

BTW, can we disable RDI while communicating with Optiga, at least as a hotfix, in case we are unable to get to the bottom of the issue in reasonable time?

onvej-sl commented 4 months ago

any specific command i should be sending to reproduce?

For example, CERTDEV READ triggers the error.

onvej-sl commented 4 months ago

So I'd recommend first trying to figure out which of the two is causing the issue.

We discussed it in person with Andrew and the conclusion is that random delay interrupts are already disabled in prodtest.

andrewkozlik commented 4 months ago

The units of WAIT TIME mentioned above are basically several clock cycles per unit of wait time:

  for (i = 0; i < random_delays_wait_time; i++) {
    __asm__("nop");
  }

So I think one unit is less than a millisecond.

onvej-sl commented 3 months ago

I was trying to find out if there is a single call of wait_random() that is responsible for all the errors. I conducted serveral tests which indicate that this call is most likely to be the cause of all the errors.

I don't quiet understand USB, but according to this wiki page (search for "SOF"), usb_class_sofis probably responsible for sending a synchronization packet every millisecond.

andrewkozlik commented 3 months ago

I see the following next steps:

  1. Either remove wait_random() from usb_class_sof() or from all places which deal with USB_IFACE_TYPE_VCP.
    • The motivation is to prevent the VCP debug interface from breaking Optiga communication when running hardware device tests. Removing it from usb_class_sof() seems sufficient, but perhaps we can be more aggressive, since VCP isn't used in production firmwares anyway. At the same time I don't want to remove wait_random() from debug firmwares completely, because it can help us detect different problems early in the testing process.
    • IIUC, the motivation behind wait_random() in USB is so that an attacker can't easily trigger a particular firmware action at an exact moment in time. That shouldn't be an issue for VCP. Do you have any opinion about this @prusnak, given that you were the one who introduced the wait_random() calls in 1ec5d091fd?
  2. Disable wait_random() in prodtest completely to avoid potential future issues. It serves no purpose there.
  3. Better understand why this is a problem for communication with Optiga, so that we can catch other potential issues which could break communication with it. This may still come back to bite us due to the random delay interrupts in normal firmware.
    • I asked @onvej-sl to look into the question of whether the random number generation and TRNG reseeding calls could be introducing an extra delay that triggers the issue.
    • Perhaps @TychoVrahe could help us understand why what looks to be a relatively small delay breaks Optiga communication.
prusnak commented 3 months ago
  • That shouldn't be an issue for VCP. Do you have any opinion about this @prusnak

I am fine with removing wait_random() from VCP since VCP is not part of the production firmware.

TychoVrahe commented 3 months ago

Looked into it today, unfortunately i still don't have an explanation. But i am able to observe the problem and found alternative ways to trigger it (by stopping stopping the MCU with debugger right before attempting to communicate with optiga). Will continue investigating next week.

TychoVrahe commented 3 months ago

I did more testing and this is what i found:

Additionally, optiga switches itself to sleep mode after 20ms of inactivity. It should wake up right after it detect its address on i2c bus, but apparently the first command after the wake up is lost. This should be mitigated by repeated optiga_ensure_ready calls, but if there is more than 20ms delay after this ensure, it can also lead to range of errors. I don't think this is currently causing problems, but it is not very robust and we need to be aware of this. The delay to sleep can be adjusted if we need to.

Solution:

andrewkozlik commented 3 months ago

Great job!

CERTDEV READ command produces i2c messages that take up to 14.6 ms to communicate over the 200kHz bus

We already bumped into this once over here: https://github.com/trezor/trezor-firmware/pull/3186#issuecomment-1656302925. My estimate was that at 200 kHz it should take 11 ms, so 15 ms seemed like a reasonable limit. What I2C_TIMEOUT do you think would be right? 20 ms, 30 ms? @onvej-sl do we have any estimate of how long the entire systick interrupt can take in the worst case scenario, i.e. not just the delay itself, but also the possible PRNG and the even less likely reseeding? Maybe the easiest thing would be to measure it experimentally.

An updated summary of next steps:

Here is a commit for the first three items https://github.com/trezor/trezor-firmware/commit/e4b46e272a3e1fb7ec9ebc65b86a8e403d31c2f9. I will wait for the I2C_TIMEOUT analysis before creating a PR.

TychoVrahe commented 3 months ago

There is likely even more going on.

The 277 byte read should take about 12.5 ms, (9 cycles per byte as there is some overhead). This is far enough for from the 15 ms limit. Measuring this with logic analyzer confirms this and i am no longer able to reproduce the 14.6ms measurement, not sure if that i misread that or something changed.

With WAIT=5000 the timeout is still triggered, but the different error observed with WAIT=2000 does not seem to be caused by the timeout.

This is the error triggered with WAIT=2000:

    if ((frame_buffer[0] & I2C_STATE_BYTE1_BUSY) == 0) {
      // Optiga has no response ready and is not busy. This shouldn't happen if
      // we are expecting to read a response, but Optiga occasionally fails to
      // give any response to a command.
      vcp_println("Error: OPTIGA_ERR_UNEXPECTED, File: %s, Line: %d", __FILE__,
                  __LINE__);
      return OPTIGA_ERR_UNEXPECTED;
    }

When this check is removed (thus, the read is allowed to be retried MAX_RETRY_READ times), optiga actually gives the response on next try, mitigating the problem.

I still do not understand why it is failing to give response on first try.

Was there any specific reason to abort the reading when this happens though? Otherwise, i think it would be helpful to allow the retry.

andrewkozlik commented 3 months ago

optiga actually gives the response on next try

Interesting, I probably never tried reading again, because if it has no response and is not busy, then there is nothing to wait for. Clearly something went wrong, I assumed.

Was there any specific reason to abort the reading when this happens though?

I don't recall. Probably just my rationale that this is just an error and nothing can be done about it.

BTW, we should also tweak the MAX_RETRY_READ constant. I think it's extremely large, because I had no clue what to set it to. Given that the longest time it takes to execute a non-RSA command is 135 ms (+ some overhead), we can probably use that as a guide. Nevertheless I am still not sure what it should be.

andrewkozlik commented 3 months ago

optiga actually gives the response on next try

Here is a commit to retry MAX_RETRY_READ_NOT_BUSY = 10 times: 99dfed01ab84da7712ecfc83284e06e4ae4137a5

andrewkozlik commented 3 months ago

Perhaps we should replace MAX_RETRY_READ with MAX_RETRY_READ_MS and monitor hal_ticks_ms() instead of counting retries. Done here: 40ce44ef95b10c555384296752d473ae19451840. I chose 300 ms. It's not terribly long, and is more than twice 135 ms.

onvej-sl commented 3 months ago

do we have any estimate of how long the entire systick interrupt can take in the worst case scenario, i.e. not just the delay itself, but also the possible PRNG and the even less likely reseeding?

I tried to measure it. In the worst case, rdi_handler takes around 0.2 ms = 200 us = 0.373 + 0.0059 512 + 2 0.4 + 154 + 42 us.

By the way, drbg_generate is called at least every 64 ms and drbg_reseed is called at least every 64 s. If this happens, the duration of wait or drbg_random8 itself doesn't play a significant role.

function time (us)
wait 0.373 + 0.0059 * delay
drbg_random8 (without drbg_generate) 0.4
drbg_generate (without drbg_reseed) 42
drbg_reseed 154
rdi_handler (worst case) 199
andrewkozlik commented 3 months ago

In https://github.com/trezor/trezor-firmware/pull/3766 I implemented most of the improvements we discussed here. There are two or three remaining items:

  1. The mystery around I2C_TIMEOUT. I am not actually sure whether this was causing a problem or not. Communicating the largest I2C message should take 12.5 ms + some extra time due to the SOF synchronization followed by wait_random() and the rdi_handler. Based on the measurements we have, this should still fit within the 15 ms timeout.

CERTDEV READ command produces i2c messages that take up to 14.6 ms to communicate over the 200kHz bus with normal random delays in that procedure, we need to be "unlucky" to get delays long enough to cause the additional 400us delay

The timeout is now increased to 25 ms and the wait_random() is removed from SOF synchronization and disabled altogether in prodtest. So I think we can consider this resolved.

2.

Additionally, optiga switches itself to sleep mode after 20ms of inactivity. It should wake up right after it detect its address on i2c bus, but apparently the first command after the wake up is lost. This should be mitigated by repeated optiga_ensure_ready calls, but if there is more than 20ms delay after this ensure, it can also lead to range of errors. I don't think this is currently causing problems, but it is not very robust and we need to be aware of this. The delay to sleep can be adjusted if we need to.

3.

We can reconfigure the communication to use DMA and interrupts, so it will run in background and will not be affected by any such random delays. I prototyped this and it helps, but clean solution needs more work, design a sensible api and redo i2c drivers for both STM32F4 and STM32U5. (so not something i will be able to push fast).

@TychoVrahe do you want to create separate issues for the two remaining points and close this issue?

TychoVrahe commented 3 months ago

Created https://github.com/trezor/trezor-firmware/issues/3839 and https://github.com/trezor/trezor-firmware/issues/3840 and closing this one.