[rom] Change the base-w representation in SPHINCS+

jadephilipoom commented 7 months ago

Description

Our current SPHINCS+ implementation is from the reference implementaion and compatible with the round 3 NIST competition submission. However, the NIST draft standard for SPHINCS+ incorporates one change that makes it backwards-incompatible with round 3 (some brief discussion on the pqc-forum mailing list here).

This change is implemented in the consistent-basew branch of the reference implementation, but has not been merged into the main branch. The change itself is not hard to implement; it's essentially a one-line edit to an internal routine that flips the endianness of some values. We'd also need to update our test vectors.

To me it's frustratingly unclear if we should implement this or not for EarlGrey PROD. On one hand, it's of course preferable to be compatible with the eventual NIST standard, which presumably is what everyone will standardize on when it comes out. On the other, there's some risk that NIST will decide last-minute on some other backwards-incompatible change before finalizing the standard, and then instead of "compatible with round 3 SPHINCS+" we're stuck in a no-man's-land in between round 3 and the final standard where we're compatible with neither (this is why we didn't implement the change for ES, even though it was announced just barely before ES ROM freeze). Admittedly by now the risk of a new change is probably small; I've not heard of other backwards-incompatible changes being considered, the comments period is closed, and the final standards are expected in 2024. I want to solicit opinions here about which course we should take; perhaps this would be a good security WG discussion topic as well.

CC @moidx @cfrantz @johannheyszl

Edit: This issue is now tracking a few SPHINCS+ updates that I want to get in before ROM freeze, so here's a checklist:

[x] Base-w endianness change (i.e. this change)
[x] Widen the key pair address field (see here -- doesn't affect the parameters we're using but needed for alternate parameters) - will be addressed by https://github.com/lowRISC/opentitan/pull/22894
[x] Compatibility with the new pure/pre-hash domain separators -- we will use the "pure" version, so this is pretty much just prepending some zeroes to the message. NIST allows this to be constructed outside the certification boundary, but since we build the message out of the contents of flash we need to either do it in ROM or play games with the manifest format. We could also allow the signature context to be specified in the manifest, but I think that is probably adding complexity for no reason.
[x] Compatibility with "de-randomized layer" for extra KATs that NIST promised to produce (this doesn't affect functionality, but allows better KAT testing, and having the functions present will make it more direct to test them once the KATs exist)

ballifatih commented 7 months ago

I think the two small changes they are introducing make sense from editorial perspective, yet it is unfortunate that the latter breaks compatibility with the original version from the designers.

Given NIST’s history of making arbitrary changes, I think it is hard to expect no further change. I would favor staying compatible with round 3 SPHINCS+. I am probably biased when it comes to not trusting NIST though.

For the future tape-outs, hopefully we would have the ratified version from NIST to be compatible with.

jadephilipoom commented 7 months ago

Based on discussion from the Security WG, we will plan to make the change to be compatible with the draft standard. Since the draft standard is fixed and is only released once before the final version, the reasoning is that compatibility with the draft standard is about as good as compatibility with the round 3 submission even in the case that NIST does make changes.

johannheyszl commented 6 months ago

@a-pronin @vsukhoml FYI: Summary re SPHINCS+: We will be compliant to the NIST draft standard and additional information from NIST about upcoming changes given at the PQC standardisation workshop today/tomorrow.

jadephilipoom commented 5 months ago

A few more updates to track/record here:

NIST gave their list of modifications for the final SPHINCS+ standard -- nothing there is too concerning.
NIST has announced they'll have a pure and also a pre-hashed version of SLH-DSA. I think we should target the pure version because it's more flexible and cryptographically stronger (see discussion on the thread). Theoretically the "hedged message" with extra domain-separation data can be constructed outside the certification boundary, but since the ROM constructs the message we'll need to include that domain-separation construction in the ROM I think (or be very clever about the manifest format). Luckily NIST has been very specific about what it will look like.
There was a bug applying to the new small parameter sets (not the current NIST standard parameter sets) that was recently fixed on the main repo, and we should apply that fix as well.

vsukhoml commented 5 months ago

Any chances to add support for LMS? The main concern regarding SPHINCS+ is a signature size, which being repeated for ROM_EXT and main firmware noticeably reduce amount of space available, which may be prohibitive for some applications. Another concern that smaller parameter size is not a NIST standard and not clear if it ever will be in foreseeable future.

Also SPHINCS+ verification takes some time, so implementation may take benefits from caching once verified signature for given code and on next verification just change that neither signature nor code didn't change to skip this step.

johannheyszl commented 5 months ago

Hi @vsukhoml, the selection of hash-based signature scheme for the hybrid secure boot option in hard-wired ROM was a longer process where we tried to carefully consider all implications. The statelessness of SPHINCS+ is a major advantage. Maybe we can discuss your specific concerns in our next meeting please. For this ROM, we will unlikely support the smaller SPHINCS+ parameters since the NIST SP is not out yet.

cc @moidx @cfrantz @timothytrippel

jadephilipoom commented 5 months ago

Any chances to add support for LMS?

I agree with Johann, it's far too late in the process for that, and the reasons we had for selecting SPHINCS+ have only gotten more solid since we made that decision.

If I remember correctly, the main concern with LMS was that managing state is expensive and risky infrastructure-wise for firmware signing (mishandling the state even once is instant-death, so backups are tricky). The only hesitations I remember with SPHINCS+ were signature size and that the standard wasn't out yet. Now, there are smaller parameter sets on the horizon with comparable signature size to LMS, and the standardization process is almost over (the final version isn't out but NIST has already committed to what will be in it),.

The main concern regarding SPHINCS+ is a signature size, which being repeated for ROM_EXT and main firmware noticeably reduce amount of space available, which may be prohibitive for some applications. Another concern that smaller parameter size is not a NIST standard and not clear if it ever will be in foreseeable future.

NIST has said on the pqc-forum mailing list that there will be an SP, although they haven't committed to a timeline for that. I understand SP is usually much faster than FIPS, though. Once they decide on the exact parameters, the smaller sets can be used for everything past ROM, since we can update the firmware.

For the current generation of chips I think we should stick with the larger parameters in ROM itself since we don't know what NIST will select, but that means we pay the higher signature size cost only once rather than several times.

Also SPHINCS+ verification takes some time, so implementation may take benefits from caching once verified signature for given code and on next verification just change that neither signature nor code didn't change to skip this step.

It's much faster (~4x) with the smaller parameter sets; faster than ECDSA. As with signature size, we can hopefully use the smaller sets for everything past ROM, even on the current generation of chips, so the overall performance impact should be small.

ballifatih commented 5 months ago

Also SPHINCS+ verification takes some time, so implementation may take benefits from caching once verified signature for given code and on next verification just change that neither signature nor code didn't change to skip this step.

https://github.com/lowRISC/opentitan/issues/20414

vsukhoml commented 5 months ago

If I remember correctly, the main concern with LMS was that managing state is expensive and risky infrastructure-wise

There are many speculations around this, but over many years I think we have good solutions to handle a single counter, and even update it if distributed among different HSM. I saw more issues with HSM going down with keys, than hypothetical reuse of the counter.

It's much faster (~4x) with the smaller parameter sets; faster than ECDSA.

LMS seems to be even faster and have 3-4x smaller signatures than SPHINCS+ with ~4K signature size.

https://github.com/lowRISC/opentitan/issues/20414

I was more thinking on pre-hashing the message (signing same digest as EC signature).

jadephilipoom commented 5 months ago

I dug into my notes from the previous discussion and made some back-of-the-envelope estimates to answer your questions and help explain the decision, since we do occasionally get questions about LMS and it's good to show the work somewhere. But ultimately, this isn't really up for debate for the current generation of chips at this point; we're far too late in the process. Hopefully the below helps explain our reasoning.

There are many speculations around this, but over many years I think we have good solutions to handle a single counter, and even update it if distributed among different HSM. I saw more issues with HSM going down with keys, than hypothetical reuse of the counter.

There are a lot of opinions on this matter; the public comments on NIST SP800-208 are a good collection. The LMS/XMSS RFC from IETF also has a section on this topic. Suffice to say there's enough discussion and concern across those sources that I'm not comfortable simply dismissing it as an easy or solved problem.

The problem is that, even if a key reuse event is very unlikely (the power might need to get cut at exactly the wrong time, for example), it's an "instant death" occurrence -- an attacker who can see the two signatures can forge arbitrary messages. It's still important to take signing-infrastructure steps to mitigate the risk, and to evaluate those steps carefully, and that's the part that I suspect is expensive and/or tricky to implement correctly and get certified.

LMS seems to be even faster and have 3-4x smaller signatures than SPHINCS+ with ~4K signature size.

I'm not sure which parameter sets you compare here; in this context the most likely SPHINCS+ parameter set would be something like shake-128s-q20, which has 3.5kB signatures, and the analogous LMS parameter set (n=32, w=4, h=20) has 2.8kB signatures (see this paper for a good explanation of signature size calculation and parameters). LMS parameter sets are always "small", e.g. max 2^10 or 2^20 signatures per key, so comparing them to the full-size SPHINCS+ parameter sets (max 2^64 signatures per key) is kind of apples to oranges.

(One remaining asymmetry in the comparison is security level, since NIST refused to standardize an LMS parameter set at PQC-competition level L1 -- I'm going to be comparing L1 SPHINCS+ parameters to ~L2 LMS parameters here, because those are the lowest security levels in standards and both are sufficient for our use case, so practically those are the options we are choosing between. At higher security levels LMS might come out more favorably, but a more general comparison of signature schemes is not the point in this discussion.)

You can get down to 1.8kB signatures with LMS by setting w=8, but the tradeoff is then performance. According to this paper (tables 9 and 10 -- note that the "w" there is the XMSS w, which is 2^w for LMS, so w=16 there means w=4 for LMS and w=256 means w=8), going from w=4 to w=8 incurs about an 8-9x (!) performance penalty for verification across all parameter sets they measured. This paper also independently confirms that ratio.

Using the pqm4 project's benchmarks and the ARM Cortex-M4 LMS paper, I was able to make some calculations estimating the performance of a few LMS parameter sets on OpenTitan (source code at the end of the comment). The Cortex-M4 paper doesn't have benchmarks for h=20, but the impact on verification time is pretty negligible for LMS (it's signature size and keygen time that h affects). So let's assume h=20 has about the same performance as h=10. Then our parameter choices are basically (ordered by verification time):

Parameters	Signature size (bytes)	Verification time (ms)
sphincsplus-shake-128s-q20	3264	2.78
LMS-h20-w4	2828	5.65*
sphincsplus-shake-128s	7856	12.98
LMS-h20-w8	1772	49.27*

*estimated, based on h=10

I can see an argument for contexts where multiple 8kB signatures are painful, although I'd still argue it's worthwhile to avoid signing infra complexity/risk. That's kind of a judgment call depending on the use case. But it seems to me there is no contest between LMS and the q20 parameters: SPHINCS+ is significantly faster for verification in this case, the signature sizes are pretty similar, and of course you get the benefits of statelessness as well.

Python script used to generate the estimates, for clarity/auditability (I printed some extra values to check my work and went over this a few times, but more eyes welcome):

# SPHINCS+ shake-128s verification on ARM Cortex-M4, from pqm4 benchmarks
# https://github.com/mupq/pqm4/blob/4584cfcef04e937c92ca6a923c9ea42e01d3122e/benchmarks.md
m4_spx_shake_128s_cycles = 24366771

# SPHINCS+ shake-128s and shake-128s-q20 verification on OpenTitan
# https://eprint.iacr.org/2022/1725.pdf (table 3)
ot_spx_shake_128s_cycles = 1298047 
ot_spx_shake_128s_q20_cycles = 277852 

# LMS verification benchmarks for Cortex-M4 with SHAKE256 with w=4, w=8
# from https://eprint.iacr.org/2020/470.pdf (table 10)
# important: these benchmarks also use the pqm4 SHAKE impl
m4_lms_shake_h10_w4_cycles = 10611902
m4_lms_shake_h10_w8_cycles = 92485431

# Approximate ratio of OT / M4 cycles based on SPHINCS+ benchmarks
# note: most of this is OT's hw SHAKE vs pqm4's sw SHAKE, since hashing is
# 80-90% of runtime
ot_m4_ratio = ot_spx_shake_128s_cycles / m4_spx_shake_128s_cycles
print(f'OT/M4: {ot_m4_ratio:0.5f}')

# Clock rate: 100MHz
cycles_per_ms = 100_000

# Print the SPX+ time estimates to sanity-check the clock rate
ot_spx_shake_128s_ms = ot_spx_shake_128s_cycles / cycles_per_ms
ot_spx_shake_128s_q20_ms = ot_spx_shake_128s_q20_cycles / cycles_per_ms
print(f'sphincsplus-shake-128s on OT: {ot_spx_shake_128s_ms:0.2f}ms')
print(f'sphincsplus-shake-128s-q20 on OT: {ot_spx_shake_128s_q20_ms:0.2f}ms')

# Get ballpark estimates for LMS on OT: ot = m4 * (ot / m4)
ot_lms_shake_h10_w4_cycles = m4_lms_shake_h10_w4_cycles * ot_m4_ratio
ot_lms_shake_h10_w8_cycles = m4_lms_shake_h10_w8_cycles * ot_m4_ratio

# Estimate the cycles for SPX+ as a sanity check on the ratio.
ot_spx_shake_128s_estimated_cycles = m4_spx_shake_128s_cycles * ot_m4_ratio
assert ot_spx_shake_128s_estimated_cycles == ot_spx_shake_128s_cycles

# Print the time estimates in ms
ot_lms_shake_h10_w4_ms = ot_lms_shake_h10_w4_cycles / cycles_per_ms
print(f'LMS-SHAKE-h10-w4 on OT: {ot_lms_shake_h10_w4_ms:0.2f}ms')
ot_lms_shake_h10_w8_ms = ot_lms_shake_h10_w8_cycles / cycles_per_ms
print(f'LMS-SHAKE-h10-w8 on OT: {ot_lms_shake_h10_w8_ms:0.2f}ms')

jadephilipoom commented 5 months ago

It's been pointed out to me that LMS doesn't rely on hash function collision resistance for its security argument, so you could safely truncate the digests here by setting n=24 instead of n=32 without dropping below L1. In that case the signatures are 1744 bytes with w=4, so about half the size compared to SPHINCS+ q20. I don't have benchmarks I can use for n=24 to get a performance comparison, but I'd guess it doesn't affect performance too much because you're computing the same hash state and simply not using as much of it, so it likely has about the same performance as n=32, w=4.

This update doesn't really change the argument above, because I still think the increased signature size is worth saving signing-infrastructure complexity, but for the sake of accuracy I wanted to issue the correction :slightly_smiling_face:

vsukhoml commented 5 months ago

Thanks for detailed comments!

There are a lot of opinions on this matter; the public comments on NIST SP800-208 are a good collection.

Yes, and I fully share concerns. But none of those concerns says it is hard to implement. They are about importance that it have to be implemented. In our specific case counter can always be reconstructed from other sources, like blockchain - git hash history of voting/signing. There are guides on how to do that like Hash-based Signatures: State and Backup Management. There are hardware-backed solutions like monotonic counters. Anyway, it is more in the beliefs space as there are no known incidents in this area as far as I know.

it's an "instant death" occurrence -- an attacker who can see the two signatures can forge arbitrary messages

This is not exactly - it highly increases chances to do that if attacker is going to solve partial preimage problem trying to get hash that will have message digits larger than digits in digests that were signed and same for its checksum. Say digest which is all zeroes (I'm still curious about its preimage) will lead to same consequences, so we have checksums for digest constructed to minimize this. Multiple digests signed with same counter greatly increase likeliness that attacker will be successful. Faulting Winternitz One-Time Signatures to forge LMS, XMSS, or SPHINCS+ signatures

I'm not sure which parameter sets you compare here; in this context the most likely SPHINCS+ parameter set would be

LMS-h20-w8 with SHA256/192 (m=24) will be 1176 ((24+2)24 + 2420 + 64 +8 ) bytes

LMS by setting w=8, but the tradeoff is then performance

It is, and I got ~10ms for verification, which could be improved with better HW - most of time spent in computing recurrent hashes and it can be done better with HW support as loading / reading data in hash engine takes most time. SPHINCS+ can be definitely faster, but lacks on signature and code size.

Summarizing - it is late to implement it now and there are concerns that signing infrastructure will be able to implement non-repeating counter. LMS is likely to be slower, but can have smaller signatures and code size. In the longer run size may be less important with current trends in increasing memory sizes and practically are the concerns just for my team on the current embodiment (Earlgrey) and thus are temporary, and benefits of stateless nature makes it more appealing.

johannheyszl commented 4 months ago

removing Hotlist label since there seems to be no discussion needed

jadephilipoom commented 4 months ago

Endianness point addressed by https://github.com/lowRISC/opentitan/pull/22953

lowRISC / opentitan

[rom] Change the base-w representation in SPHINCS+ #21944

Description