make sphinx replay detection more time efficient

david415 commented 5 years ago

very obviously the current sphinx replay detection is not as fast as it could be. certainly the current design is very simple; that is it's only virtue.

burdges commented 5 years ago

At some point, I concluded a cuckoo filter might support inserts into a large but still in memory table faster than bloom filters, due to only requiring two cache misses, while bloom filter inserts always require k.

david415 commented 5 years ago

fair enough optimization and we can do that for sure... however i am thinking of a much more complicated optimization. currently our sphinx replay detection is described in full detail in this document here: https://github.com/katzenpost/docs/blob/master/specs/sphinx_replay_detection.rst

currently we actually fully process the sphinx packet before doing replay detection whereas we can first match the entire packet via some kind of hash map or something first... but only insert into the replay cache after the DH and MAC validation. in other words, we can have multiple fast paths which are faster than the current fastest path.

burdges commented 5 years ago

Oh? I'd believed replay protection should check exclusively the key exchange result because replay attacks might be on the key exchange itself, and attackers must obtain the correct key exchange result anyways. I think tricks like HDKD cannot be secure for mixnets anyways, so you cannot gain anything by doing the whole header.

I suppose MAC validation and decryption could happen concurrently, maybe improving performance, but one could imagine instead doing key exchange and replay protection in SGX or whatever, and later doing everything else outside SGX, so maybe I'd structure the code that way.

david415 commented 5 years ago

what does HDKD mean? actually we do gain something by matching the whole header and maybe even the whole packet. we gain a fast path. we can easily hash the packet and then compare that output with a hashmap. that would be way faster than doing a DH and then a MAC.

as stated in that document i linked above, what we currently do now is completely process the entire sphinx packet including the payload and then match the replay tag. slow as fuck.

and yes we would also check the replay of the key exchange. nothing i said is mutually exclusive with matching replays of the DH shared secrets. it's what we do now.

Yawning commented 5 years ago

Sigh.

So, the design rationale behind the current algorithm includes but is not limited to:

It's the simplest possible algorithm that:
- Has zero false negatives (probably a non-negotiable invariant).
- Has zero false positives (this might be ok to change).
- Supports an extremely large numbers of entries (uses where this isn't a requirement, performance doesn't matter).
Certain people flipped out when I proposed faster alternatives back in the design phase.
An early-reject fast-path is only a net gain when the packet is a replay (because that saves work). If the fraction of packets that are replays is low (or non-existent), adding such a thing is a waste of CPU and memory, because you need to do the full packet processing regardless.

david415 commented 5 years ago

@Yawning I like the current design without the early-reject fast-path, however the one obvious way it could be sped up is to do the replay cache lookup/insert after the DH and MAC but before the SPRP.

alternatively the packet's public key could be used in the replay cache lookup before the DH and MAC... but only inserted into the replay cache after the DH and MAC. it would probably make the code a bit messy perhaps. maybe that's why there was resistance from certain people during the design phase.

@burdges um yeah we could use a cuckoo filter... we certainly do not need the deletion feature... but having insertions be faster would be good. I don't understand why you mentioned doing MAC validation and decryption concurrently. I've tried previously to explain to you that in a proper software router all of these operations should happen in a thread pool so there is already plenty concurrency happening. i'll say it again... if you are writing routers with tokio or whatever async thing then you are doing it wrong. here read this: http://www.sosp.org/2001/papers/welsh.pdf

katzenpost / docs

make sphinx replay detection more time efficient #52