Closed ni4 closed 7 months ago
Nothing that comes to mind immediately, unfortunately. Which version did you update from?
@reneme Thanks for the reply!
It was latest 2.x from vcpkg sources. Don't have Windows setup yet unfortunately to dig deeper. PR with added isolated test case (among with runner information) is here in case it would be useful: https://github.com/rnpgp/rnp/pull/2148, please see changes to src/tests/cipher.cpp
I managed to build RNP on my machine and can reproduce the crash. Looking into it.
Edit: I have a lead and am digging deeper.
The root cause is indeed a bug in Botan's OCB implementation. Perhaps already in 2.x, actually.
ocb.cpp
contains a helper class called L_computer
. And this has a method get(size_t i)
that lazily creates new m_L
values as necessary. The parameter i
is the number of least-significant zeros in the binary representation of the current block index (ctz()
); i.e. the bigger the input data, the larger i
gets eventually.
Here's the implementation of get(size_t i)
:
... now, if we hit an i
that we haven't hit before, we calculate the new value and .push_back()
it into the m_L
vector. The new value for i
is then returned as a reference into that vector.
Below is an excerpt of the function that uses get()
. Note, in particular, how the references to L0
and L1
are retained and kept outside the loop.
We're calling get()
inside the loop (line 79), which is unsafe. In your reproducer, once we hit block_index = 8192
we call get(ctz(8192))
, which needs to lazily add a value to m_L
. This re-allocates the m_L
vector, therefore invalidates the references to L0
and L1
and crashes in the next loop iteration.
I suspect, the reason that this crashes on some platforms and not on all is, that std::vector<>
has some freedom to decide how much memory is pre-allocated when .push_back()
is invoked. I.e. not every .push_back()
will actually cause a re-allocation of the vector. Also, there are other places in the code that call get()
, and that won't trigger this issue when re-allocating. Thus, there are some circumstances that have to align for this to become evident. That's probably, why it remained undetected for 6 years (https://github.com/randombit/botan/commit/444eeb5ebcb65de8f063e90a31a4709214dfe78f), when this was released in Botan 2.4.0.
Because of the above, it might be tough to build a reasonable regression test, but I'll try regardless and of course build a fix.
Thanks a lot for reporting this!
@randombit I'm assuming we want to have a backport to 2.x for this?
The patch in #3814 does fix the issue, though I'm not sure its the optimal thing to do. I propose we take further technical discussion over into the pull request.
@ni4 Thanks again for reporting and the valuable reproducer.
@reneme Wow, thanks for fixing this and for the details! Another lurking magical bug with simple and logical explanation :) At some point started to think that we should abandon random-input tests, but this case reverts the idea.
@reneme Was this backported to 2.19.4? Do not see it in the changelog.
I'm afraid that fell through the cracks. 😨 I'm going to open a PR for it. Unfortunately, only for the next maintenance release, then. Sorry for that and thanks for the reminder.
Np, thanks for confirming!
As it turns out, this bug is much much easier to hit in 3.x than it is in 2.x. I had to invent a toy cipher with a different intrinsic block-parallelism to be able to produce a situation where the access violation would hit. Though, the odds are also dependent on the re-allocation behavior of std::vector<>
. Hence, its still worth fixing in 2.x, I would say. Thanks again for following up.
We use Botan as cryptography backend for RNP OpenPGP library. And recently, after updating CI runners to Botan 3.1.1, started to receive failures, only for Windows platform. After isolating the issue and enabling sanitizers got the following call stack. Suspecting something in RNP which may corrupt the memory (however sanitizers doesn't repor that), I was able to isolate the issue to pure botan calls with some certain input.
Do you have any idea why this happens? Should I provide some more input, like test data file/order of decryption calls causing the failure?