[[clang::unsafe_buffer_usage]] in libc++

danakj commented 2 months ago

There are many methods in libc++ which can cause out-of-bounds issues when given incorrect inputs, such as any method that takes one or more iterators as its inputs, or that takes a pointer input.

Will libc++ be annotating such methods with [[clang::unsafe_buffer_usage]]? Is the project open to adding such annotations on methods that receive iterators (instead of ranges)?

Concrete example: std::ranges::subrange::subrange(iterator, sentinel) if given invalid inputs will create a subrange that goes out of bounds. This is similar to std::span(first, size), which is currently hard-coded in the compiler as-if it were marked with [[clang::unsafe_buffer_usage]]. Other examples: std::span::span(first, last), std::vector::insert(pos, first, last), std::memcpy(dest, src, count).

Putting such annotations in libc++ will help callers avoid unsafe APIs and transition to safer ones.

We would need need all [[clang::unsafe_buffer_usage]] to live behind a config define to allow enabling it separately from rolling libc++ though.

Thoughts? Is this something we could do now? At some future time? Explicitly undesirable?

cc: @haoNoQ @ziqingluo-90 @jkorous-apple @ldionne

pkasting commented 2 months ago

I'd like to see this happen so that:

We can simplify compiler-side implementation, hopefully making it more robust
There's a clearer and more achievable path to getting the warning in other compilers
Library-side usage of the attributes is battle-tested at scale, so they're known-suitable for usage on authors' own types because pain points are fixed
libc++ sources serve as an example of how and where to use (much harder to understand usage by reading compiler source)
If anyone ever wants to propose something like this to WG21, it's clear in practice what it would imply

ldionne commented 2 months ago

CC @var-const

I could see this being useful, however we'd have to agree on what needs to be marked. For example, we support an ABI configuration under which iterators are bounded and will trap on OOB access. Would we consider an iterator-based API to be "unsafe" under that configuration?

philnik777 commented 2 months ago

IMO this would need an RFC, since it's a major change within libc++. That RFC should mention

what the benefit of doing that is
what the changes inside libc++ would look like
how this interacts with the hardening effort

Also some things that will probably come up:

How do we handle cases that are sometimes perfectly safe, like output iterators (keyword: back_insert_iterator)
What about interfaces where there is no safe alternative - there are probably some

haoNoQ commented 2 months ago

A few things here:

[[clang::unsafe_buffer_usage]] in the standard library is arguably an option (unless @ldionne objects), but it's also likely that with standard functions we can do better by simply hardcoding them in the compiler. This includes things like false positive suppression for known-safe idioms, as well as better diagnostic messages.
C functions such as std::memcpy() are now hardcoded in the compiler as unsafe: #101583 - and it showcases a few benefits of hardcoding those in the compiler.
Iterator and range based functions are as unsafe as their underlying iterator implementations:
- We're currently focusing on the situation where the ABI-defining option _LIBCPP_ABI_BOUNDED_ITERATORS is enabled on your platform, which makes these functions safe as long as bounded iterators are used.
- If raw pointers are passed to these functions, we should arguably warn unconditionally; but even then, the compiler might do a better job than the attribute.
- With custom containers iterators, we should probably give up and emit no warnings and that's actually perfectly fine. The model simply assumes that eventually you'll get to address the warning inside the implementation of your container, which will make you either harden the iterators, or annotate your .begin()/.end() as unsafe with the attribute.
- This leaves us with standard iterators in the situation where _LIBCPP_ABI_BOUNDED_ITERATORS cannot be enabled. IIUC we can't promise a solution in this case but we are exploring a few ideas (@ldionne can probably elaborate a bit better).
Finally, there's a few other standard functions that also need to be considered unsafe. In particular, std::unique_ptr<T[]>::operator[] should probably be considered unsafe despite the clever hardening trick implemented by #91798; same with other smart pointers. We'll probably cover those in a moment.

danakj commented 2 months ago

Is iterator hardening always sufficient to prevent UB?

void f(std::random_access_iterator auto begin, std::random_access_iterator auto end) {
  if (std::distance(begin, end) < 10) { ... }
}

In this example if the iterators come from different containers (point to different allocations), we get UB, and the iterator checks don't help, right?

https://github.com/llvm/llvm-project/blob/b3d2d5039b9b8aa10a86c593387f200b15c02aef/libcxx/include/__iterator/bounded_iter.h#L191-L194

haoNoQ commented 2 months ago

In this example if the iterators come from different containers (point to different allocations), we get UB, and the iterator checks don't help, right?

They actually help quite a bit because an iterator from a different container will naturally be out-of-bounds for the other container. It's just a matter of remembering to check that. The only way this doesn't work is when it's two overlapping spans of the same buffer. But even then, it's more of a technical UB than a practical UB, because the underlying operation is kind of valid anyway given that it's the same buffer(?)

In any case, looks like there's a TODO about that specifically: https://libcxx.llvm.org/Hardening.html#assertion-categories

valid-input-range – checks that ranges (whether expressed as an iterator pair, an iterator and a sentinel, an iterator and a count, or a std::range) given as input to library functions are valid: - the sentinel is reachable from the begin iterator; - TODO(hardening): both iterators refer to the same container.

davidben commented 2 months ago

Keep in mind that there are still some gaps in bounded iterators. They're fixable gaps, but someone needs to go through and fix them. #78771 comes to mind. But they're also simply bugs in to fix in bounded iterators, not a problem in the approach.

To that end, I think the issue @danakj points out is simply a bug in bounded iterators, not a reason to reject the idea. operator- can just check that the two iterators are compatible by checking for a match in __begin_ and __end_. In fact, I'd already flagged that is a prerequisite to fixing #78771. See step 5 in https://github.com/llvm/llvm-project/issues/78771#issuecomment-1911188368

davidben commented 2 months ago

TBH I can probably just cite all the bugs I've opened, as almost of them have been motivated in some form by bounded iterators. 😄 (But I don't have all that much time to hack on stuff, so I file more bugs than I have time to send patches for.) https://github.com/llvm/llvm-project/issues/created_by/davidben

davidben commented 2 months ago

@danakj pointed out a case where we're kinda stuck with unbounded iterators. std::begin(T(&)[N]) is defined to return `T*, not an iterator. So even if we manage to get 100% bounded iterator coverage across an entire project's containers, there would need to be some answer for what happens when you pass pointers into an STL function.

(Or perhaps we get -fbounds-safety working in C++ so that T* is also bounded. 😄 )

danakj commented 1 month ago

If raw pointers are passed to these functions, we should arguably warn unconditionally; but even then, the compiler might do a better job than the attribute.

With the standard library, it's possible for the compiler to do this in a nice way, I agree. What about for other library authors?

From what I can see with current APIs, this would require every method that takes an iterator to have an overload taking a pointer, so that the latter can be annotated [[clang::unsafe_buffer_usage]].

For example:

template <std::input_iterator I>
void insert(I begin, I end);

Would need to become something like:

template <std::input_iterator I>
void insert(I begin, I end);

[[clang::unsafe_buffer_usage]]
void insert(value_type* begin, value_type* end);

Having to double the number of functions in APIs like this to get the unsafe-buffers warning only when working with a pointer is really unfortunate. And it seems like the compiler could do better here for other libraries beyond the stdlib.

One possibility, an optional boolean on the attribute.

template <std::input_iterator I>
[[clang::unsafe_buffer_usage(std::is_pointer_v<I>)]]
void insert(I begin, I end);

How else could we get warnings for pointers in these cases, in a consistent way between std and elsewhere?

haoNoQ commented 1 month ago

Would need to become something like:

template <std::input_iterator I>
void insert(I begin, I end);

[[clang::unsafe_buffer_usage]]
void insert(value_type* begin, value_type* end);

It'd actually need to become something like:

template <std::input_iterator I>
void insert(I begin, I end) {
    // body
}

void insert(value_type* begin, value_type* end) {
#pragma clang unsafe_buffer_usage begin // or address some of those warnings
    // body
#pragma clang unsafe_buffer_usage end
}

because the attribute doesn't suppress the warning inside the function. It only notifies the caller that this function should not be used at all. Addressing the warnings inside the function may still make your code strictly safer, even if it doesn't eliminate the unsafety entirely.

Side note, -Wunsafe-buffer-usage currently has a tricky false negative when it comes to templates in headers on project/library boundaries. When all we see is a piece of uninstantiated template code, we often can't notice any unsafe operations in it. A lot of the time we can't even notice that there are any pointers in it. We have no idea what types may go in. In such situations we can't emit a warning to the author of the library that their template is unsafe. And without that, the author is less likely to add [[clang::unsafe_buffer_usage]] to their template voluntarily.

On the other hand, if the user of the library includes the library through -isystem (which is the intended way to include the code you don't control), we aren't allowed to emit warnings inside the template either. Because the compiler is simply not allowed to emit warnings against -isystem headers - given that users can't fix them anyway. So the users will not be able to encourage the author to add [[clang::unsafe_buffer_usage]] to their template, given that they don't see any warnings either.

So in order to display any warning at all in this scenario, we'll need to do something very unusual: emit the warning in user code and unwrap the instantiation stack in reverse in order to put a note where the actual unsafe operation is. (It's probably not enough to simply put a note at the unsafe operation. Without the instantiation stack it may become incomprehensible.)

Once we implement that, it may be true that the header doesn't need #pragma clang unsafe_buffer_usage anymore. The pragma will go to the call site instead (if at all). They may not need the attribute either. At least not until they instantiate the template this way in their own library (so we never had that problem to begin with).

But the point still stands: if the function is unsafe, there are often a few ways to make it slightly safer, even if it continues to be unsafe. In this sense, a separate instantiation is often necessary.

Not always though. So:

One possibility, an optional boolean on the attribute.
template <std::input_iterator I>
[[clang::unsafe_buffer_usage(std::is_pointer_v<I>)]]
void insert(I begin, I end);

I still really like this. It may be exactly what we need in some cases. We could make the pragma conditional too?

#pragma clang unsafe_buffer_usage begin(std::is_pointer_v<I>)
template <std::input_iterator I>
[[clang::unsafe_buffer_usage(std::is_pointer_v<I>)]]
void insert(I begin, I end) {
    // body
}
#pragma clang unsafe_buffer_usage end

davidben commented 1 month ago

I like the boolean on the attribute too. Should it be a type trait or something, so we can distinguish __wrap_iter (unsafe) from __bounded_iter (safe)?

I mean really what we want is to say something like "this is safe as long as I::operator*, I::operator++, etc., are safe, but I suspect that would be more challenging than a trait.

philnik777 commented 1 month ago

Closing, since the only actionable task here is to create an RFC, which isn't the task of libc++ contributors (necessarily).

davidben commented 1 month ago

@philnik777 I think this should be reopened. The problem stands and it seems there is plenty of discussion on how we might solve it. It is a thorny problem and will likely require some deeper discussion on how best to solve it, but thorny problems are still problems. Particularly when they impact memory safety.

philnik777 commented 1 month ago

@davidben These are all valid points, but nothing that could come out of that discussion is actionable for us. This is much too big of a project for this to be part of regular maintenance, which means that there has to be an RFC and a volunteer to actually implement this. For that reason I don't think this is the right forum for such a discussion. A discourse thread would be a much better option, and reaches a far wider audience.

davidben commented 1 month ago

I see. So, to confirm, libc++'s stance is that you all do not consistently track problems in GitHub issues and instead move discussions in between GitHub and Discourse threads depending on how difficult they end up being to solve?

Is this stance documented somewhere? @ldionne, is this consistent with your understanding of the project's process? This definitely would have been useful context for me in figuring out how best to track safety gaps (a key, user-facing, security-critical problem in C++ implementations today) in libc++'s STL implementation.

This also seems like a problematic strategy for making progress on issues, which should ultimately be the goal. When a user-facing problem is discovered, it is often not obvious from the start whether it will end up being simple or require significant changes. That means issues will naturally start in GitHub. To have them be closed and we hope someone else manually transfers it to Discourse means losing track of a lost of past discussion.

If libc++ wishes to use this strategy, do you all have an automatic process for synchronizing the two? Asking people to manually transfer issues between GitHub and Discourse is a good way for them to slip between the cracks. For every GitHub issue someone files, there are many more people who go to file a GitHub issue, search for existing ones, see it's already been filed, and skip filing one. All those people's needs end up slipping through as a result of this strategy.

I would suggest a better strategy might be to keep the GitHub issues open, as they remain issues. This one's next step is simply "come up with a proposal and start a discourse thread". But in that critical time period in between, it's important to keep the ticket open so people can find it, add thoughts to it, and hopefully eventually lead to a solution.

ldionne commented 1 month ago

My understanding of our process is that Discourse RFCs are used to "cement" decisions that have wide impact, such as this one. This is kind of documented here, but to be fair it's not suuuuper clear.

That being said, I personally don't see a problem with a discussion taking place in a Github issue if it is making progress and if the relevant people are in the thread. All that I would ask is that at the end of said discussion, a Discourse RFC be written to follow our consensus-seeking process: that RFC could explain what is being proposed (which would be the result of the discussion here), and could also rationalize the proposed design by summarizing the conclusions drawn here.

I am speculating a bit, but I think @philnik777 is coming at this from a slightly different angle, where issues are supposed to track concrete actionable things, otherwise we end up accumulating a ton of "issues" and our bug tracker becomes a mess. I think that is also a valid point, and so we should try not to make issues like this linger for too long. We could also perhaps have a label like discussion, and we should close any issue or discussion that becomes stale. I suspect that's one of the reasons for @philnik777 closing this -- the last comment was 3 weeks old, so this was kind of losing momentum and at that point the next step might be a RFC.

So, in summary, I'm fine with this discussion continuing here if it's the most productive way of moving forward, but if it loses momentum, that's probably a signal that it's time for a concrete Discourse RFC where we can get consensus on an actual approach and get started.

Going back to technical bits. About

#pragma clang unsafe_buffer_usage begin(std::is_pointer_v<I>)

My understanding is that the compiler should flag the point at which a raw pointer is obtained from a data structure (way before it gets passed to on of our APIs like insert), which is going to be in the user code, not in libc++.

If we go for something like this:

#pragma clang unsafe_buffer_usage begin(std::is_pointer_v<I>)
template <std::input_iterator I>
[[clang::unsafe_buffer_usage(std::is_pointer_v<I>)]]
void insert(I begin, I end) {
    // body
}
#pragma clang unsafe_buffer_usage end

then that means that basically every single iterator-based API in the library needs to be annotated, since all these APIs accept pointers, which are a kind of iterator. That raises a flag in my mind, since there ought to be a mechanism that doesn't require something so mechanical and repetitive.

If we made the assumption that iterator-based APIs are in fact safe (since we can harden iterators), we would only have to flag uses of raw pointer arithmetic in the compiler, no? We'd mark such arithmetic inside libc++ hardened iterators as "trust me that's safe", and the compiler would do all the rest. Where am I going astray here (I'm certain this approach has been considered before)?

danakj commented 1 month ago

libc++ is system headers so unsafe-buffers warning does not fire in the headers. We need the warning to happen on the call to the libc++ function when the caller is passing something unbounded (I think the discussion here is assuming iterators are bounded, so a pointer).

ldionne commented 1 month ago

libc++ is system headers so unsafe-buffers warning does not fire in the headers.

Yes, but the user code that creates the pointer in the first place is not in a system header, right?

[...] (I think the discussion here is assuming iterators are bounded, so a pointer).

I think there's a typo here. You either mean "iterators are not bounded, so a pointer", or "iterators are bounded, so not a pointer". Which one did you mean?

danakj commented 1 month ago

I mean that the discussion is assuming all iterators are bounded, so the warning should fire when passing a pointer to a function that will do arithmetic on it within. That's the [[clang::unsafe_buffer_usage(std::is_pointer_v<It>)]] meaning.

It feels a bit wishful to think that code using libc++ will be unable to create a pointer pair that would go OOB without hitting a warning, thanks to the vast ecosystem of pointer code. The function that would go OOB if given bad pointers is responsible for marking its API as such and generating a warning.

Introduced a new function attribute __attribute__((unsafe_buffer_usage)) to be worn by functions containing buffer operations that could cause out of bounds memory accesses. It emits warnings at call sites to such functions when the flag -Wunsafe-buffer-usage is enabled. -- https://reviews.llvm.org/D138940

ldionne commented 1 month ago

It feels a bit wishful to think that code using libc++ will be unable to create a pointer pair that would go OOB without hitting a warning, thanks to the vast ecosystem of pointer code.

Definitely a bit wishful, but then again what's the benefit to the user code if the only unsafe usage of raw pointers that we flag are the ones in the standard library, but none of their own code using raw pointers unsafely gets flagged (unless they too adopt the attribute)?

Stepping back, I think there's ambiguity in

functions containing buffer operations that could cause out of bounds memory accesses

Basically every function that accesses a buffer via raw pointers can lead to an OOB access if it gets passed invalid information (e.g. an invalid length, an invalid pointer, etc). Is that really useful to flag though? At that point, shouldn't we be flagging the access to the buffer itself, and then we're back to my suggestion above?

Another (much more targeted) take on this would be that we only want to flag APIs that are fundamentally unsafe, for example, the 3-legged std::equal algorithm:

template <class InputIterator1, class InputIterator2>
bool equal(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2);

That API should basically not be used since a 4-iterators overload now exists:

template <class InputIterator1, class InputIterator2>
bool equal(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, InputIterator2 last2);

The 3-legged variant is a clear example of an unsafe API, no matter what kind of iterators we have.

I think I'm not convinced (yet) about the value of applying this attribute to all Standard library APIs that accept iterators:

We won't warn when these APIs are used with custom iterators that are unsafe (except pointers). That also includes libc++'s own unbounded iterators.
The user code will likely still have plenty of uses of raw pointers that go un-diagnosed
What is the intended user-side fix to this diagnostic?
Do we consider ranges-based APIs to be safe? (presumably yes)

It could be that I'm not seeing the big picture. I'll note, however, that if we were to do this, it could probably tie into a solution for https://github.com/llvm/llvm-project/issues/78771 since every algorithm will likely have to be taught about the concept of "iterator unwrapping" to solve that issue, and this could be a place where we diagnose that we actually unwrapped from something unsafe. Just a thought.

llvm / llvm-project

[[clang::unsafe_buffer_usage]] in libc++ #107904