__sanitizer_start_switch_fiber() unmaps per-thread "fake" stack

stsp commented 1 month ago

It seems like these days detect_stack_use_after_return breaks fiber switching. When detect_stack_use_after_return is enabled, asan malloc's the "fake" stack and puts the locals there together with redzones. That process is (partially) documented here: https://github.com/google/sanitizers/wiki/AddressSanitizerUseAfterReturn That very same stack ptr is put into the first argument of __sanitizer_start_switch_fiber(). If detect_stack_use_after_return is disabled, then the "fake stack" machinery is not used, so __sanitizer_start_switch_fiber() always puts NULL into its first arg.

Now the problem is, the fake-stack is per-thread, not per-fiber. When some fiber exits, we put NULL into the first arg of __sanitizer_start_switch_fiber(), and that unmaps the entire per-thread fake-stack: https://gnu.googlesource.com/gcc/+/refs/heads/trunk/libsanitizer/asan/asan_thread.cpp#166 Which, as noted above, contains current locals and redzones. So all crashes.

Probably __sanitizer_start_switch_fiber() should allocate and free its own fake stacks, and not touch the per-thread one?

stsp commented 1 month ago

Things seem to work right as long as __sanitizer_start_switch_fiber() never gets NULL as the first argument. So maybe it should just be re-documented?

ioquatix commented 1 month ago

If you think it's mainly a documentation issue, why not submit a PR for the documentation?

stsp commented 1 month ago

That will break all existing users, and also will make the first arg completely redundant, as the stack then never changes within tid.

Also I wonder if it is actually correct. Coroutine switch changes the current stack position (AFAICS %r13 is used to address fake stack), but I am not sure if the fake stack allocator is prepared for such switches. So thinking about this a bit more, I really suspect the documentation change is not the way forward. If such documentation change is done, then also disabling detect_stack_use_after_return for fibers is needed. But it might be better to allocate per-fiber fake-stacks.

ioquatix commented 1 month ago

I wrote a note here, maybe it will help you: https://github.com/ruby/ruby/blob/a15e4d405ba6cafbe2f63921bd771b1241049841/cont.c#L830-L839

stsp commented 1 month ago

Yes, that helped me getting rid of the WARNING: ASan is ignoring requested __asan_handle_no_return, just as you wrote in your note.

Unfortunately this has nothing to do with the reported problem: unless detect_stack_use_after_return is disabled, you can't pass NULL as the first argument of __sanitizer_start_switch_fiber(). And your suggestion is about not passing NULLs to second and third args, not first. Many thanks for the help though, that problem was also real, plus I've found bugs in my own code while integrating your solution.

stsp commented 1 month ago

of __sanitizer_start_switch_fiber(). And your suggestion is about not passing NULLs to second and third args

... of __sanitizer_finish_switch_fiber(), not __sanitizer_start_switch_fiber().

stsp commented 1 month ago

In fact, you seem to have the potential NULL call here: https://github.com/ruby/ruby/blob/a15e4d405ba6cafbe2f63921bd771b1241049841/cont.c#L1577 Could you please check 2 things:

this is actually called with NULL
when it isn't called with NULL, it sets old_fiber->context.fake_stack to non-NULL?

ioquatix commented 1 month ago

I can take a look.

ioquatix commented 1 month ago

I'll try to confirm this with real data from CRuby, but my understanding is as follows:

this is actually called with NULL

Yes, at the end of the life of the fiber, we call it with NULL to indicate that the fake stack can be freed.

when it isn't called with NULL, it sets old_fiber->context.fake_stack to non-NULL?

Yes, that seems correct, it should save the current fake stack, but I'll confirm it by logging the output.

KJTsanaktsidis commented 1 month ago

OK so I think the way this works is...

__sanitizer_start_switch_fiber unmaps the fake stack in AsanThread::fake_stack_ if NULL is passed as the first parameter
__sanitizer_finish_switch_fiber changes the fake stack associated with AsanThread::fake_stack_.

So the fake stack that is unmapped in __sanitizer_start_switch_fiber is replaced in __sanitizer_finish_switch_finish.

So what you're supposed to do when switching fibers:

IF the fiber being switched from is dead, then call: __sanitizer_start_switch_fiber(NULL, to_fiber->stack_base, new_fiber->stack_size)
Otherwise, if the fiber being switched from will be returned to eventually, then call: __sanitizer_start_switch_fiber(&from_fiber->fake_stack_ptr, to_fiber->stack_base, to_fiber->stack_size). That will save the current fake stack in &from_fiber->fiber_stack_ptr. It will also set the stack bounds for the switched-to fiber to stack_base/stack_size.
Then, after you switch out %rsp/%rip,
the new fiber should call __sanitizer__finish_switch_fiber(to_fiber->fake_stack_ptr, &from_fiber->stack_base, &from_fiber->stack_size). This will switch AsanFiber::fake_stack_ to the to-fiber's stack pointer, and retrieves the current stack base/size in &from_fiber.
- Although this only has to happen at one "time" (when a fiber is switched to), this is two locations in code. Once in the new fiber creation path (will be called the first time a new fiber is switched to), and once at the end of the %rsp/%rip restore code (every subsequent time a fiber is switche dto).
- Note that we don't need to save &from_fiber->stack_base more than once for a fiber (it won't change), so it's OK to pass NULL, NULL in the second place.

I'm far from an expert on this but this is what the Ruby code is doing and I think it's correct.

stsp commented 1 month ago

__sanitizer_finish_switch_fiber changes the fake stack associated with AsanThread::fakestack.

Changes to what? The problem is that fake_stack is per-thread. So once unmapped, it gets changed to the same unmapped stack. https://gnu.googlesource.com/gcc/+/refs/heads/trunk/libsanitizer/asan/asan_thread.cpp#166 This is the libasan code, its very easy to follow. You can see where the per-thread stack is destroyed, and that its not re-allocated by the fiber-switching routines.

I'm far from an expert on this but this is what the Ruby code is doing and I think it's correct.

Its correct by matching the libasan documentation. But libasan itself doesn't match its own documentation.

Yes, that seems correct, it should save the current fake stack, but I'll confirm it by logging the output.

Thanks! If you can confirm it is actually non-NULL (which would already be strange), please also confirm that every coroutine sets it to the different value. This will mean every coroutine has its own fake-stack, just as documented. Which will mean some other version of libasan, not the one I am pointing to the sources of.

KJTsanaktsidis commented 1 month ago

Ahh, i think i see where the confusion is - the allocation of fake stacks in ASAN is lazy.

You can see where the per-thread stack is destroyed, and that its not re-allocated by the fiber-switching routines

~Not quite - the current_fake_stack being destroyed there is not a threadlocal variable or anything like that - it's the member variable AsanThread::_fake_stack.~ I mean it was threadlocal but the destruction is conditional on it being NULL, as you say.

When we start a new fiber in Ruby, we actually call __sanitizer_finish_switch_fiber(NULL, stack_base, stack_len) (https://github.com/ruby/ruby/blob/fc495951b128a4256ba28be560584d2a58530f21/cont.c#L839). to->fake_stack is actually always set to NULL on fiber create here: https://github.com/ruby/ruby/blob/fc495951b128a4256ba28be560584d2a58530f21/coroutine/arm64/Context.h#L76

__sanitizer_finish_start_fiber duly sets the current threadlocal fake stack to NULL - https://gnu.googlesource.com/gcc/+/refs/heads/trunk/libsanitizer/asan/asan_thread.cpp#163 - and that isn't undone by __sanitizer_finish_switch_fiber on coroutine start because we pass NULL and so don't hit this block: https://gnu.googlesource.com/gcc/+/refs/heads/trunk/libsanitizer/asan/asan_thread.cpp#176

So immediately after fiber creation, the thread doesn't yet have a fake stack.

TL;DR of this: You're right - we do pass NULL to the first invocation of __sanitizer_finish_switch_fiber in every fiber.

HOWEVER, as part of dispatching the first ruby code onto the new fiber, we call asan_get_thread_fake_stack_handle(), via this stacktrace:

(gdb) bt
#0  0x00005555556d2340 in __asan::SetTLSFakeStack(__asan::FakeStack*) ()
#1  0x000055555576bfcc in __asan::AsanThread::AsyncSignalSafeLazyInitFakeStack() ()
#2  0x0000555555955b22 in asan_get_thread_fake_stack_handle () at ../internal/sanitizers.h:241
#3  ec_switch (th=0x7fffc7600000, fiber=0x7fffcca1ff3c, fiber@entry=0x516000199b80) at ../cont.c:800
#4  fiber_restore_thread (th=0x7fffc7600000, fiber=0x7fffcca1ff3c, fiber@entry=0x516000199b80) at ../cont.c:820
#5  0x0000555555955ccb in fiber_entry (from=<optimized out>, to=<optimized out>) at ../cont.c:848
#6  0x0000000000000000 in ?? ()

AsyncSignalSafeLazyInitFakeStack does what it says - if the thread's fake stack is NULL, it initializes a new one. So that is how the fake stack for a new fiber is created.

stsp commented 1 month ago

__sanitizer_finish_start_fiber duly sets the current threadlocal fake stack to NULL

You have a typo here, meant to say __sanitizer_start_switch_fiber.

TL;DR of this: You're right - we do pass NULL to the first invocation of __sanitizer_finish_switch_fiber in every fiber.

This wasn't my point, as I was talking about passing NULL to __sanitizer_start_switch_fiber, not to __sanitizer_finish_switch_fiber. While I note your clever trick about calling __asan_get_current_fake_stack() on a fiber start, this doesn't solve the reported problem. Please note 2 things:

fake-stack is still per-thread. Once you unmap and re-create in, you unmap and re-create the per-thread stack. It doesn't become per-fiber.
When detect_stack_use_after_return is enabled, you actually work on a per-thread fake-stack! It doesn't contain the return addresses, but it does contain your locals.

This means that once __sanitizer_start_switch_fiber(NULL, ...) returns, every access to locals will crash. The only thing you can do, is to immediately switch to another coroutine, without touching any locals. This is possible because the return addresses are located on a normal stack, not fake-stack, so you can still do the calls. But this works only when you switch to a new fiber, that quickly re-creates the fake stack with your __asan_get_current_fake_stack(). This scenario actually works, I checked it. What still doesn't work is the switch to non-new coroutine. Non-new coroutine doesn't have the chance to apply your trick and re-create the fake-stack. And even if it could, that still won't help because it already has the locals on an unmapped fake-stack.

So to summarize. Your trick allowed me to get a bit further, if I:

don't touch any locals after __sanitizer_start_switch_fiber(NULL, ...) and quickly jump to another fiber instead.
that "another fiber" is a new fiber that doesn't yet have any locals (because its new, it doesn't have anything), so it can re-create the fake stack.

Obviously this is not an acceptable solution.

KJTsanaktsidis commented 1 month ago

fake-stack is still per-thread. Once you unmap and re-create in, you unmap and re-create the per-thread stack. It doesn't become per-fiber.

this is not correct. A new fiber starts with a NULL fake stack, and gains a fake stack lazily. When you switch away from the fiber, a pointer to its fake stack is saved in the first parameter to __sanitize_start_stack_switch.

Perhaps this is best summarised as: __sanitiser_start_stack_switch will either:

transfer ownership of the current fibers fake stack to the caller, with the first parameter being an out param to receive it, or
Deallovate the current fibers fake stack it NULL is passed in
but in either case, it will set the thread local fake stack pointer to NULL

When resuming a fiber with __sanitizer_finish_stack_switch

for the first time, pass NULL as the first argument; a fake stack will be lazily created
Subsequent times, using the pointer saved from the out-param of __sanitizer_start_stack_switch

stsp commented 1 month ago

OK, I applied your suggestion to my repo: https://github.com/dosemu2/dosemu2/commit/e65243720a92938f64180ed43df1a8e73dbcdf55 just to make sure we are on the same ground. Now up to the discussion.

A new fiber starts with a NULL fake stack, and gains a fake stack lazily.

Not quite, as you set NULL fake stack by calling __sanitizer_finish_stack_switch(NULL, ...); explicitly. So the beginning of a fiber still works on an old fake-stack, but that is not a problem, as with your technique it is never unmapped. I have realized that now, after fixing a problem in my code. :) Still, its not a reported problem, but you helped me fixing another one, thank you.

for the first time, pass NULL as the first argument; a fake stack will be lazily created

No, it is created only because you explicitly call __asan_get_current_fake_stack(), as you showed in your stack-trace. This is an undocumented trick. I added it to my repo so that we can continue this quest, but please note that undocumented tricks are normally not counted as a solutions.

Why this trick doesn't save the world, is because the stack is re-created too late. After the call to __sanitiser_start_stack_switch(NULL, ...); you are already out of your locals. You must be very careful to not touch any locals and quickly jump to another fiber. If you do so - everything works.

So 2 problems remain:

we have a completely undocumented "bug fixer" which is the need to explicitly call __asan_get_current_fake_stack(); after every stack switch.
even this bug-fixer doesn't allow us touching any locals between the call to __sanitizer_start_switch_fiber(NULL, ...); and swapcontext().

So let me say, your trick is very clever and it can even solve the problem fully, but I am not fond of either of 2 limitations above.

stsp commented 1 month ago

Note that I am not applying the "don't touch any locals between __sanitizer_start_switch_fiber(NULL, ...); and swapcontext()" trick to my repo, because I consider that a bit too much. So your trick being in my repo, currently doesn't help at all. But it allowed me to make sure things are actually working the way you describe, and locally I can produce a fully working setup now, which wasn't possible w/o your trick.

KJTsanaktsidis commented 1 month ago

your trick is very clever and it can even solve the problem fully

hm, you’re giving me too much credit here, I added this call for entirely unrelated reasons (the GC needs the info). Is a fake stack not lazily generated when calling into an instrumented function in a fiber for the first time too?

doesn't allow us touching any locals between the call to __sanitizer_start_switch_fiber(NULL, ...); and swapcontext().

this is a guess accidentally not a problem for us in Ruby because the bulk of our switching machinery (coroutine_transfer) is implemented in assembly. So we don’t actually have really any C code between the start and finish switches. Perhaps you should use an attribute of some kind on your stack switching code to disable Asan instrumentation on that function? (We should probably do that in CRuby too for the function that calls sanitizer_start_stack_switch too)

stsp commented 1 month ago

Is a fake stack not lazily generated when calling into an instrumented function in a fiber for the first time too?

OMG... Well, yes it does. :) Once you properly set the stack to NULL, it gets re-allocated automatically. I made this fix: https://github.com/dosemu2/dosemu2/commit/03b92988d9a5a798c6e98e7e5e7b0ddbb673b6fa while playing with your __asan_get_current_fake_stack(); trick, but I didn't realize that this fix alone is enough, so I mistakenly applied also your trick. :( Force-pushed it away now, its not needed. So only the problem of touching locals stays...

So we don’t actually have really any C code between the start and finish switches.

No, this code: https://github.com/ruby/ruby/blob/a15e4d405ba6cafbe2f63921bd771b1241049841/cont.c#L1576-L1581 calls to sanitizer and then to swapcontext(). Its obviously that you use no locals in between. But add some, and you'll finally reproduce my bug report. :)

stsp commented 1 month ago

To understand why is this important, please see this code of mine: https://github.com/dosemu2/dosemu2/blob/03b92988d9a5a798c6e98e7e5e7b0ddbb673b6fa/src/base/lib/libpcl/pcl.c#L72 This exact line crashes. I later use the fake_stack_save var here: https://github.com/dosemu2/dosemu2/blob/03b92988d9a5a798c6e98e7e5e7b0ddbb673b6fa/src/base/lib/libpcl/pcl.c#L82 While I can get the fake-stack-ptr from the context structure, which would avoid the problem, I follow the asan docs precisely: https://chromium.googlesource.com/chromiumos/third_party/compiler-rt/+/google/stable/include/sanitizer/common_interface_defs.h#184

In most cases, this void* can be stored on the stack just before
switching.

So asan doc itself suggests storing that into a local, rather than via other means. And that suggestion breaks.

stsp commented 1 month ago

OK, actually I do realize that even more precise reading would reveal that in case of a NULL, nothing is actually stored in a local... I just use the NULL which I stored there manually. So in that case no one suggests me to use it. So I can't even blame the asan's suggestion here...

stsp commented 1 month ago

Do you think this should be re-classified to a documentation problem, so that the doc just say to not touch any locals after __sanitizer_start_switch_fiber(NULL, ...);? IMHO its quite a bad thing to document though.

KJTsanaktsidis commented 1 month ago

No, this code: https://github.com/ruby/ruby/blob/a15e4d405ba6cafbe2f63921bd771b1241049841/cont.c#L1576-L1581 calls to sanitizer and then to swapcontext(). Its obviously that you use no locals in between. But add some, and you'll finally reproduce my bug report. :)

Yes, I fully agree - this is only working in Ruby at the moment by accident.

I suppose we should mark fiber_setcontext and fiber_entry in Ruby with __attribute__((no_sanitize("address"))). That I think would be the more correct thing to do - can't have ASAN enabled in the code which deals with switching the ASAN fake stacks around.

Do you think this should be re-classified to a documentation problem

Yes - I would like to contribute some documentation improvements to ASAN & Fibers but I actually don't know how. It's a github wiki and I don' t know how to propose changes. I think the docs should...

Explain __sanitizer_start_switch_fiber and __sanitizer_finish_switch_fiber in terms of "either handing off ownership of the fake stack, or destroying it" like I did in my last message - IMHO that's the clearest way to think about it.
Tell you to disable ASAN in the funtions which call these functions (and any other functions between them & swapcontext or your assembly swapping routine)
Explain where in code you will put these calls, not just where in time (i.e. call __sanitizer_start_switch_fiber before swapcontext, with a conditional to handle the dead case, one call to __sanitizer_finish_switch_fiber in your fiber entrypoint to handle the new fiber case, and one call to __sanitizer_finish_switch_fiber after swapcontext to handle the resuming-existing-fiber case).

I mostly had to figure this stuff out for Ruby by trial and error and looking at the LLVM sources.

KJTsanaktsidis commented 1 month ago

FWIW I had a look at your code in dosemu, it looks correct to me now, other than that I think you need __attribute__((no_sanitize("address")) on co_switch_context and maybe co_runner.

Thanks for opening this issue, it's clearly very under-documented and It was useful for me to have a closer look at how this all works in Ruby too :)

stsp commented 1 month ago

Thank you very much! __attribute__((no_sanitize("address"))) is indeed a very nice trick that I now applied. With that trick, this problem is certainly doc-only.

Another thing worth documenting, is the one from https://github.com/google/sanitizers/issues/1760#issuecomment-2141728709 where @ioquatix says how the stack for the main coroutine should be retrieved. I.e. that __sanitizer_finish_switch_fiber() should never have NULLs in the second and third args.

stsp commented 1 month ago

There are a few more things to add. I referenced 2 commits here which I did in a separate branch. One commit just removes the fake-stack ptr! After our discussion it appears redundant, because on a coroutine entry you are supposed to invalidate the stack by passing NULL to __sanitizer_finish_switch_fiber(). But I can just pass NULL explicitly, and no need for the saved ptr at all.

Another patch adds no-sanitize attribute to the coroutine entry func. This is because the fake-stack is re-initialized here. Yes, the old one is not unmapped, so we do not observe any crash. But still the asan's view of the fake stack mismatches to the reality, so to avoid any problems I suggest adding an attribute also there.

Could you please review these 2 observations?

ioquatix commented 1 month ago

For my own understanding:

Another patch adds no-sanitize attribute to the coroutine entry func.

Does it only apply to the function but not the coroutine co->func(co->data);?

But I can just pass NULL explicitly, and no need for the saved ptr at all.

Isn't this only when you want to free the internal fake stack?

stsp commented 1 month ago

I made another very important observation (now also referenced here), which is that @ioquatix suggestion to not have NULLs in the second and third args of __sanitizer_finish_switch_fiber() are only applied to the start of the coroutine, but not to the return of some coroutine. Because we never return from the main coroutine, but we can start some from the main one.

So it appears:

on coroutine entry you should call __sanitizer_finish_switch_fiber(NULL, &ptr, &len)
on coroutine return you should call __sanitizer_finish_switch_fiber(fake_save, NULL, NULL)

So basically they "complement" each other, and I'd rather add a separate function for starting a fiber.

If either (or all) of my observations are true, then this API should be documented entirely differently, making my observations explicit.

Does it only apply to the function but not the coroutine co->func(co->data);?

Yes, because co->func() is called later, on a new fake-stack already. Whereas co_runner() is on an old one. Its not unmapped only by a very deep magic in the @KJTsanaktsidis 's logic. :)

Isn't this only when you want to free the internal fake stack?

No, here I am talking about __sanitizer_finish_switch_fiber(), not __sanitizer_start_switch_fiber(). We call __sanitizer_finish_switch_fiber() on a fiber start, and there we MUST pass NULL as a first arg. Unlike __sanitizer_start_switch_fiber() it doesn't unmap the fake-stack, only invalidates it to allocate a new one later. We always need to invalidate and switch the stack on fiber entry, which leads to this "always-NULL" observation.

stsp commented 1 month ago

Because we never return from the main coroutine,

A vague explanation. We can return from main coroutine to some other if that "other" called the main coroutine at some point. But that can't happen before we called to that "other" coroutine from the main one. So to catch the stack of the main coroutine, it is enough to do that on "other" coroutine entry.

google / sanitizers

__sanitizer_start_switch_fiber() unmaps per-thread "fake" stack #1760