Open danakj opened 2 months ago
I'd like to see this happen so that:
CC @var-const
I could see this being useful, however we'd have to agree on what needs to be marked. For example, we support an ABI configuration under which iterators are bounded and will trap on OOB access. Would we consider an iterator-based API to be "unsafe" under that configuration?
IMO this would need an RFC, since it's a major change within libc++. That RFC should mention
Also some things that will probably come up:
back_insert_iterator
)A few things here:
std::memcpy()
are now hardcoded in the compiler as unsafe: #101583 - and it showcases a few benefits of hardcoding those in the compiler._LIBCPP_ABI_BOUNDED_ITERATORS
is enabled on your platform, which makes these functions safe as long as bounded iterators are used..begin()
/.end()
as unsafe with the attribute._LIBCPP_ABI_BOUNDED_ITERATORS
cannot be enabled. IIUC we can't promise a solution in this case but we are exploring a few ideas (@ldionne can probably elaborate a bit better).std::unique_ptr<T[]>::operator[]
should probably be considered unsafe despite the clever hardening trick implemented by #91798; same with other smart pointers. We'll probably cover those in a moment.Is iterator hardening always sufficient to prevent UB?
void f(std::random_access_iterator auto begin, std::random_access_iterator auto end) {
if (std::distance(begin, end) < 10) { ... }
}
In this example if the iterators come from different containers (point to different allocations), we get UB, and the iterator checks don't help, right?
In this example if the iterators come from different containers (point to different allocations), we get UB, and the iterator checks don't help, right?
They actually help quite a bit because an iterator from a different container will naturally be out-of-bounds for the other container. It's just a matter of remembering to check that. The only way this doesn't work is when it's two overlapping spans of the same buffer. But even then, it's more of a technical UB than a practical UB, because the underlying operation is kind of valid anyway given that it's the same buffer(?)
In any case, looks like there's a TODO about that specifically: https://libcxx.llvm.org/Hardening.html#assertion-categories
valid-input-range – checks that ranges (whether expressed as an iterator pair, an iterator and a sentinel, an iterator and a count, or a std::range) given as input to library functions are valid: - the sentinel is reachable from the begin iterator; - TODO(hardening): both iterators refer to the same container.
Keep in mind that there are still some gaps in bounded iterators. They're fixable gaps, but someone needs to go through and fix them. #78771 comes to mind. But they're also simply bugs in to fix in bounded iterators, not a problem in the approach.
To that end, I think the issue @danakj points out is simply a bug in bounded iterators, not a reason to reject the idea. operator-
can just check that the two iterators are compatible by checking for a match in __begin_
and __end_
. In fact, I'd already flagged that is a prerequisite to fixing #78771. See step 5 in https://github.com/llvm/llvm-project/issues/78771#issuecomment-1911188368
TBH I can probably just cite all the bugs I've opened, as almost of them have been motivated in some form by bounded iterators. 😄 (But I don't have all that much time to hack on stuff, so I file more bugs than I have time to send patches for.) https://github.com/llvm/llvm-project/issues/created_by/davidben
@danakj pointed out a case where we're kinda stuck with unbounded iterators. std::begin(T(&)[N])
is defined to return `T*, not an iterator. So even if we manage to get 100% bounded iterator coverage across an entire project's containers, there would need to be some answer for what happens when you pass pointers into an STL function.
(Or perhaps we get -fbounds-safety
working in C++ so that T*
is also bounded. 😄 )
If raw pointers are passed to these functions, we should arguably warn unconditionally; but even then, the compiler might do a better job than the attribute.
With the standard library, it's possible for the compiler to do this in a nice way, I agree. What about for other library authors?
From what I can see with current APIs, this would require every method that takes an iterator to have an overload taking a pointer, so that the latter can be annotated [[clang::unsafe_buffer_usage]]
.
For example:
template <std::input_iterator I>
void insert(I begin, I end);
Would need to become something like:
template <std::input_iterator I>
void insert(I begin, I end);
[[clang::unsafe_buffer_usage]]
void insert(value_type* begin, value_type* end);
Having to double the number of functions in APIs like this to get the unsafe-buffers warning only when working with a pointer is really unfortunate. And it seems like the compiler could do better here for other libraries beyond the stdlib.
One possibility, an optional boolean on the attribute.
template <std::input_iterator I>
[[clang::unsafe_buffer_usage(std::is_pointer_v<I>)]]
void insert(I begin, I end);
How else could we get warnings for pointers in these cases, in a consistent way between std and elsewhere?
Would need to become something like:
template <std::input_iterator I> void insert(I begin, I end); [[clang::unsafe_buffer_usage]] void insert(value_type* begin, value_type* end);
It'd actually need to become something like:
template <std::input_iterator I>
void insert(I begin, I end) {
// body
}
void insert(value_type* begin, value_type* end) {
#pragma clang unsafe_buffer_usage begin // or address some of those warnings
// body
#pragma clang unsafe_buffer_usage end
}
because the attribute doesn't suppress the warning inside the function. It only notifies the caller that this function should not be used at all. Addressing the warnings inside the function may still make your code strictly safer, even if it doesn't eliminate the unsafety entirely.
Side note, -Wunsafe-buffer-usage
currently has a tricky false negative when it comes to templates in headers on project/library boundaries. When all we see is a piece of uninstantiated template code, we often can't notice any unsafe operations in it. A lot of the time we can't even notice that there are any pointers in it. We have no idea what types may go in. In such situations we can't emit a warning to the author of the library that their template is unsafe. And without that, the author is less likely to add [[clang::unsafe_buffer_usage]]
to their template voluntarily.
On the other hand, if the user of the library includes the library through -isystem
(which is the intended way to include the code you don't control), we aren't allowed to emit warnings inside the template either. Because the compiler is simply not allowed to emit warnings against -isystem
headers - given that users can't fix them anyway. So the users will not be able to encourage the author to add [[clang::unsafe_buffer_usage]]
to their template, given that they don't see any warnings either.
So in order to display any warning at all in this scenario, we'll need to do something very unusual: emit the warning in user code and unwrap the instantiation stack in reverse in order to put a note where the actual unsafe operation is. (It's probably not enough to simply put a note at the unsafe operation. Without the instantiation stack it may become incomprehensible.)
Once we implement that, it may be true that the header doesn't need #pragma clang unsafe_buffer_usage
anymore. The pragma will go to the call site instead (if at all). They may not need the attribute either. At least not until they instantiate the template this way in their own library (so we never had that problem to begin with).
But the point still stands: if the function is unsafe, there are often a few ways to make it slightly safer, even if it continues to be unsafe. In this sense, a separate instantiation is often necessary.
Not always though. So:
One possibility, an optional boolean on the attribute.
template <std::input_iterator I> [[clang::unsafe_buffer_usage(std::is_pointer_v<I>)]] void insert(I begin, I end);
I still really like this. It may be exactly what we need in some cases. We could make the pragma conditional too?
#pragma clang unsafe_buffer_usage begin(std::is_pointer_v<I>) template <std::input_iterator I> [[clang::unsafe_buffer_usage(std::is_pointer_v<I>)]] void insert(I begin, I end) { // body } #pragma clang unsafe_buffer_usage end
I like the boolean on the attribute too. Should it be a type trait or something, so we can distinguish __wrap_iter
(unsafe) from __bounded_iter
(safe)?
I mean really what we want is to say something like "this is safe as long as I::operator*
, I::operator++
, etc., are safe, but I suspect that would be more challenging than a trait.
Closing, since the only actionable task here is to create an RFC, which isn't the task of libc++ contributors (necessarily).
@philnik777 I think this should be reopened. The problem stands and it seems there is plenty of discussion on how we might solve it. It is a thorny problem and will likely require some deeper discussion on how best to solve it, but thorny problems are still problems. Particularly when they impact memory safety.
@davidben These are all valid points, but nothing that could come out of that discussion is actionable for us. This is much too big of a project for this to be part of regular maintenance, which means that there has to be an RFC and a volunteer to actually implement this. For that reason I don't think this is the right forum for such a discussion. A discourse thread would be a much better option, and reaches a far wider audience.
I see. So, to confirm, libc++'s stance is that you all do not consistently track problems in GitHub issues and instead move discussions in between GitHub and Discourse threads depending on how difficult they end up being to solve?
Is this stance documented somewhere? @ldionne, is this consistent with your understanding of the project's process? This definitely would have been useful context for me in figuring out how best to track safety gaps (a key, user-facing, security-critical problem in C++ implementations today) in libc++'s STL implementation.
This also seems like a problematic strategy for making progress on issues, which should ultimately be the goal. When a user-facing problem is discovered, it is often not obvious from the start whether it will end up being simple or require significant changes. That means issues will naturally start in GitHub. To have them be closed and we hope someone else manually transfers it to Discourse means losing track of a lost of past discussion.
If libc++ wishes to use this strategy, do you all have an automatic process for synchronizing the two? Asking people to manually transfer issues between GitHub and Discourse is a good way for them to slip between the cracks. For every GitHub issue someone files, there are many more people who go to file a GitHub issue, search for existing ones, see it's already been filed, and skip filing one. All those people's needs end up slipping through as a result of this strategy.
I would suggest a better strategy might be to keep the GitHub issues open, as they remain issues. This one's next step is simply "come up with a proposal and start a discourse thread". But in that critical time period in between, it's important to keep the ticket open so people can find it, add thoughts to it, and hopefully eventually lead to a solution.
My understanding of our process is that Discourse RFCs are used to "cement" decisions that have wide impact, such as this one. This is kind of documented here, but to be fair it's not suuuuper clear.
That being said, I personally don't see a problem with a discussion taking place in a Github issue if it is making progress and if the relevant people are in the thread. All that I would ask is that at the end of said discussion, a Discourse RFC be written to follow our consensus-seeking process: that RFC could explain what is being proposed (which would be the result of the discussion here), and could also rationalize the proposed design by summarizing the conclusions drawn here.
I am speculating a bit, but I think @philnik777 is coming at this from a slightly different angle, where issues are supposed to track concrete actionable things, otherwise we end up accumulating a ton of "issues" and our bug tracker becomes a mess. I think that is also a valid point, and so we should try not to make issues like this linger for too long. We could also perhaps have a label like discussion
, and we should close any issue or discussion that becomes stale. I suspect that's one of the reasons for @philnik777 closing this -- the last comment was 3 weeks old, so this was kind of losing momentum and at that point the next step might be a RFC.
So, in summary, I'm fine with this discussion continuing here if it's the most productive way of moving forward, but if it loses momentum, that's probably a signal that it's time for a concrete Discourse RFC where we can get consensus on an actual approach and get started.
Going back to technical bits. About
#pragma clang unsafe_buffer_usage begin(std::is_pointer_v<I>)
My understanding is that the compiler should flag the point at which a raw pointer is obtained from a data structure (way before it gets passed to on of our APIs like insert
), which is going to be in the user code, not in libc++.
If we go for something like this:
#pragma clang unsafe_buffer_usage begin(std::is_pointer_v<I>)
template <std::input_iterator I>
[[clang::unsafe_buffer_usage(std::is_pointer_v<I>)]]
void insert(I begin, I end) {
// body
}
#pragma clang unsafe_buffer_usage end
then that means that basically every single iterator-based API in the library needs to be annotated, since all these APIs accept pointers, which are a kind of iterator. That raises a flag in my mind, since there ought to be a mechanism that doesn't require something so mechanical and repetitive.
If we made the assumption that iterator-based APIs are in fact safe (since we can harden iterators), we would only have to flag uses of raw pointer arithmetic in the compiler, no? We'd mark such arithmetic inside libc++ hardened iterators as "trust me that's safe", and the compiler would do all the rest. Where am I going astray here (I'm certain this approach has been considered before)?
libc++ is system headers so unsafe-buffers warning does not fire in the headers. We need the warning to happen on the call to the libc++ function when the caller is passing something unbounded (I think the discussion here is assuming iterators are bounded, so a pointer).
libc++ is system headers so unsafe-buffers warning does not fire in the headers.
Yes, but the user code that creates the pointer in the first place is not in a system header, right?
[...] (I think the discussion here is assuming iterators are bounded, so a pointer).
I think there's a typo here. You either mean "iterators are not bounded, so a pointer", or "iterators are bounded, so not a pointer". Which one did you mean?
I mean that the discussion is assuming all iterators are bounded, so the warning should fire when passing a pointer to a function that will do arithmetic on it within. That's the [[clang::unsafe_buffer_usage(std::is_pointer_v<It>)]]
meaning.
It feels a bit wishful to think that code using libc++ will be unable to create a pointer pair that would go OOB without hitting a warning, thanks to the vast ecosystem of pointer code. The function that would go OOB if given bad pointers is responsible for marking its API as such and generating a warning.
Introduced a new function attribute
__attribute__((unsafe_buffer_usage))
to be worn by functions containing buffer operations that could cause out of bounds memory accesses. It emits warnings at call sites to such functions when the flag-Wunsafe-buffer-usage
is enabled. -- https://reviews.llvm.org/D138940
It feels a bit wishful to think that code using libc++ will be unable to create a pointer pair that would go OOB without hitting a warning, thanks to the vast ecosystem of pointer code.
Definitely a bit wishful, but then again what's the benefit to the user code if the only unsafe usage of raw pointers that we flag are the ones in the standard library, but none of their own code using raw pointers unsafely gets flagged (unless they too adopt the attribute)?
Stepping back, I think there's ambiguity in
functions containing buffer operations that could cause out of bounds memory accesses
Basically every function that accesses a buffer via raw pointers can lead to an OOB access if it gets passed invalid information (e.g. an invalid length, an invalid pointer, etc). Is that really useful to flag though? At that point, shouldn't we be flagging the access to the buffer itself, and then we're back to my suggestion above?
Another (much more targeted) take on this would be that we only want to flag APIs that are fundamentally unsafe, for example, the 3-legged std::equal
algorithm:
template <class InputIterator1, class InputIterator2>
bool equal(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2);
That API should basically not be used since a 4-iterators overload now exists:
template <class InputIterator1, class InputIterator2>
bool equal(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, InputIterator2 last2);
The 3-legged variant is a clear example of an unsafe API, no matter what kind of iterators we have.
I think I'm not convinced (yet) about the value of applying this attribute to all Standard library APIs that accept iterators:
It could be that I'm not seeing the big picture. I'll note, however, that if we were to do this, it could probably tie into a solution for https://github.com/llvm/llvm-project/issues/78771 since every algorithm will likely have to be taught about the concept of "iterator unwrapping" to solve that issue, and this could be a place where we diagnose that we actually unwrapped from something unsafe. Just a thought.
There are many methods in libc++ which can cause out-of-bounds issues when given incorrect inputs, such as any method that takes one or more iterators as its inputs, or that takes a pointer input.
Will libc++ be annotating such methods with
[[clang::unsafe_buffer_usage]]
? Is the project open to adding such annotations on methods that receive iterators (instead of ranges)?Concrete example:
std::ranges::subrange::subrange(iterator, sentinel)
if given invalid inputs will create a subrange that goes out of bounds. This is similar tostd::span(first, size)
, which is currently hard-coded in the compiler as-if it were marked with[[clang::unsafe_buffer_usage]]
. Other examples:std::span::span(first, last)
,std::vector::insert(pos, first, last)
,std::memcpy(dest, src, count)
.Putting such annotations in libc++ will help callers avoid unsafe APIs and transition to safer ones.
We would need need all
[[clang::unsafe_buffer_usage]]
to live behind a config define to allow enabling it separately from rolling libc++ though.Thoughts? Is this something we could do now? At some future time? Explicitly undesirable?
cc: @haoNoQ @ziqingluo-90 @jkorous-apple @ldionne