Open stsp opened 1 month ago
Things seem to work right as long as
__sanitizer_start_switch_fiber()
never
gets NULL as the first argument.
So maybe it should just be re-documented?
If you think it's mainly a documentation issue, why not submit a PR for the documentation?
That will break all existing users, and also will make the first arg completely redundant, as the stack then never changes within tid.
Also I wonder if it is actually correct.
Coroutine switch changes the current
stack position (AFAICS %r13
is used
to address fake stack), but I am not
sure if the fake stack allocator is
prepared for such switches. So thinking
about this a bit more, I really suspect
the documentation change is not the
way forward. If such documentation
change is done, then also disabling
detect_stack_use_after_return
for
fibers is needed.
But it might be better to allocate
per-fiber fake-stacks.
I wrote a note here, maybe it will help you: https://github.com/ruby/ruby/blob/a15e4d405ba6cafbe2f63921bd771b1241049841/cont.c#L830-L839
Yes, that helped me getting rid of
the WARNING: ASan is ignoring requested __asan_handle_no_return
,
just as you wrote in your note.
Unfortunately this has nothing to
do with the reported problem: unless
detect_stack_use_after_return
is disabled,
you can't pass NULL as the first argument
of __sanitizer_start_switch_fiber()
.
And your suggestion is about not passing
NULLs to second and third args, not first.
Many thanks for the help though, that
problem was also real, plus I've found
bugs in my own code while integrating
your solution.
of __sanitizer_start_switch_fiber(). And your suggestion is about not passing NULLs to second and third args
... of __sanitizer_finish_switch_fiber()
,
not __sanitizer_start_switch_fiber()
.
In fact, you seem to have the potential NULL call here: https://github.com/ruby/ruby/blob/a15e4d405ba6cafbe2f63921bd771b1241049841/cont.c#L1577 Could you please check 2 things:
old_fiber->context.fake_stack
to non-NULL?I can take a look.
I'll try to confirm this with real data from CRuby, but my understanding is as follows:
this is actually called with NULL
Yes, at the end of the life of the fiber, we call it with NULL
to indicate that the fake stack can be freed.
when it isn't called with NULL, it sets
old_fiber->context.fake_stack
to non-NULL
?
Yes, that seems correct, it should save the current fake stack, but I'll confirm it by logging the output.
OK so I think the way this works is...
__sanitizer_start_switch_fiber
unmaps the fake stack in AsanThread::fake_stack_
if NULL is passed as the first parameter__sanitizer_finish_switch_fiber
changes the fake stack associated with AsanThread::fake_stack_
.So the fake stack that is unmapped in __sanitizer_start_switch_fiber
is replaced in __sanitizer_finish_switch_finish
.
So what you're supposed to do when switching fibers:
__sanitizer_start_switch_fiber(NULL, to_fiber->stack_base, new_fiber->stack_size)
__sanitizer_start_switch_fiber(&from_fiber->fake_stack_ptr, to_fiber->stack_base, to_fiber->stack_size)
. That will save the current fake stack in &from_fiber->fiber_stack_ptr
. It will also set the stack bounds for the switched-to fiber to stack_base
/stack_size
. %rsp
/%rip
,__sanitizer__finish_switch_fiber(to_fiber->fake_stack_ptr, &from_fiber->stack_base, &from_fiber->stack_size)
. This will switch AsanFiber::fake_stack_
to the to-fiber's stack pointer, and retrieves the current stack base/size in &from_fiber
.
%rsp
/%rip
restore code (every subsequent time a fiber is switche dto). &from_fiber->stack_base
more than once for a fiber (it won't change), so it's OK to pass NULL, NULL
in the second place.I'm far from an expert on this but this is what the Ruby code is doing and I think it's correct.
__sanitizer_finish_switch_fiber changes the fake stack associated with AsanThread::fakestack.
Changes to what? The problem is that fake_stack is per-thread. So once unmapped, it gets changed to the same unmapped stack. https://gnu.googlesource.com/gcc/+/refs/heads/trunk/libsanitizer/asan/asan_thread.cpp#166 This is the libasan code, its very easy to follow. You can see where the per-thread stack is destroyed, and that its not re-allocated by the fiber-switching routines.
I'm far from an expert on this but this is what the Ruby code is doing and I think it's correct.
Its correct by matching the libasan documentation. But libasan itself doesn't match its own documentation.
Yes, that seems correct, it should save the current fake stack, but I'll confirm it by logging the output.
Thanks! If you can confirm it is actually non-NULL (which would already be strange), please also confirm that every coroutine sets it to the different value. This will mean every coroutine has its own fake-stack, just as documented. Which will mean some other version of libasan, not the one I am pointing to the sources of.
Ahh, i think i see where the confusion is - the allocation of fake stacks in ASAN is lazy.
You can see where the per-thread stack is destroyed, and that its not re-allocated by the fiber-switching routines
~Not quite - the current_fake_stack
being destroyed there is not a threadlocal variable or anything like that - it's the member variable AsanThread::_fake_stack
.~ I mean it was threadlocal but the destruction is conditional on it being NULL, as you say.
When we start a new fiber in Ruby, we actually call __sanitizer_finish_switch_fiber(NULL, stack_base, stack_len)
(https://github.com/ruby/ruby/blob/fc495951b128a4256ba28be560584d2a58530f21/cont.c#L839). to->fake_stack
is actually always set to NULL on fiber create here: https://github.com/ruby/ruby/blob/fc495951b128a4256ba28be560584d2a58530f21/coroutine/arm64/Context.h#L76
__sanitizer_finish_start_fiber
duly sets the current threadlocal fake stack to NULL - https://gnu.googlesource.com/gcc/+/refs/heads/trunk/libsanitizer/asan/asan_thread.cpp#163 - and that isn't undone by __sanitizer_finish_switch_fiber
on coroutine start because we pass NULL and so don't hit this block: https://gnu.googlesource.com/gcc/+/refs/heads/trunk/libsanitizer/asan/asan_thread.cpp#176
So immediately after fiber creation, the thread doesn't yet have a fake stack.
TL;DR of this: You're right - we do pass NULL to the first invocation of __sanitizer_finish_switch_fiber
in every fiber.
HOWEVER, as part of dispatching the first ruby code onto the new fiber, we call asan_get_thread_fake_stack_handle()
, via this stacktrace:
(gdb) bt
#0 0x00005555556d2340 in __asan::SetTLSFakeStack(__asan::FakeStack*) ()
#1 0x000055555576bfcc in __asan::AsanThread::AsyncSignalSafeLazyInitFakeStack() ()
#2 0x0000555555955b22 in asan_get_thread_fake_stack_handle () at ../internal/sanitizers.h:241
#3 ec_switch (th=0x7fffc7600000, fiber=0x7fffcca1ff3c, fiber@entry=0x516000199b80) at ../cont.c:800
#4 fiber_restore_thread (th=0x7fffc7600000, fiber=0x7fffcca1ff3c, fiber@entry=0x516000199b80) at ../cont.c:820
#5 0x0000555555955ccb in fiber_entry (from=<optimized out>, to=<optimized out>) at ../cont.c:848
#6 0x0000000000000000 in ?? ()
AsyncSignalSafeLazyInitFakeStack
does what it says - if the thread's fake stack is NULL, it initializes a new one. So that is how the fake stack for a new fiber is created.
__sanitizer_finish_start_fiber duly sets the current threadlocal fake stack to NULL
You have a typo here, meant to say
__sanitizer_start_switch_fiber
.
TL;DR of this: You're right - we do pass NULL to the first invocation of __sanitizer_finish_switch_fiber in every fiber.
This wasn't my point, as I was talking about
passing NULL to __sanitizer_start_switch_fiber
,
not to __sanitizer_finish_switch_fiber
.
While I note your clever trick about calling
__asan_get_current_fake_stack()
on a fiber
start, this doesn't solve the reported problem.
Please note 2 things:
detect_stack_use_after_return
is enabled,
you actually work on a per-thread fake-stack!
It doesn't contain the return addresses, but it
does contain your locals.This means that once __sanitizer_start_switch_fiber(NULL, ...)
returns, every access to locals will crash. The only
thing you can do, is to immediately switch to
another coroutine, without touching any locals.
This is possible because the return addresses
are located on a normal stack, not fake-stack,
so you can still do the calls.
But this works only when you switch to a new
fiber, that quickly re-creates the fake stack with
your __asan_get_current_fake_stack()
. This
scenario actually works, I checked it. What still
doesn't work is the switch to non-new coroutine.
Non-new coroutine doesn't have the chance to
apply your trick and re-create the fake-stack.
And even if it could, that still won't help because
it already has the locals on an unmapped
fake-stack.
So to summarize. Your trick allowed me to get a bit further, if I:
__sanitizer_start_switch_fiber(NULL, ...)
and quickly jump to another fiber instead.Obviously this is not an acceptable solution.
fake-stack is still per-thread. Once you unmap and re-create in, you unmap and re-create the per-thread stack. It doesn't become per-fiber.
this is not correct. A new fiber starts with a NULL fake stack, and gains a fake stack lazily. When you switch away from the fiber, a pointer to its fake stack is saved in the first parameter to __sanitize_start_stack_switch
.
Perhaps this is best summarised as: __sanitiser_start_stack_switch
will either:
When resuming a fiber with __sanitizer_finish_stack_switch
__sanitizer_start_stack_switch
OK, I applied your suggestion to my repo: https://github.com/dosemu2/dosemu2/commit/e65243720a92938f64180ed43df1a8e73dbcdf55 just to make sure we are on the same ground. Now up to the discussion.
A new fiber starts with a NULL fake stack, and gains a fake stack lazily.
Not quite, as you set NULL fake stack by
calling __sanitizer_finish_stack_switch(NULL, ...);
explicitly.
So the beginning of a fiber still works on
an old fake-stack, but that is not a problem,
as with your technique it is never unmapped.
I have realized that now, after fixing a problem
in my code. :)
Still, its not a reported problem, but you
helped me fixing another one, thank you.
for the first time, pass NULL as the first argument; a fake stack will be lazily created
No, it is created only because you
explicitly call __asan_get_current_fake_stack()
,
as you showed in your stack-trace.
This is an undocumented trick.
I added it to my repo so that we can continue
this quest, but please note that undocumented
tricks are normally not counted as a solutions.
Why this trick doesn't save the world, is because
the stack is re-created too late. After the call
to __sanitiser_start_stack_switch(NULL, ...);
you are already out of your locals. You must
be very careful to not touch any locals and
quickly jump to another fiber.
If you do so - everything works.
So 2 problems remain:
__asan_get_current_fake_stack();
after every
stack switch.__sanitizer_start_switch_fiber(NULL, ...);
and swapcontext()
.So let me say, your trick is very clever and it can even solve the problem fully, but I am not fond of either of 2 limitations above.
Note that I am not applying the
"don't touch any locals between
__sanitizer_start_switch_fiber(NULL, ...);
and swapcontext()
" trick to my repo,
because I consider that a bit too much.
So your trick being in my repo, currently
doesn't help at all. But it allowed me to
make sure things are actually working the
way you describe, and locally I can produce
a fully working setup now, which wasn't
possible w/o your trick.
your trick is very clever and it can even solve the problem fully
hm, you’re giving me too much credit here, I added this call for entirely unrelated reasons (the GC needs the info). Is a fake stack not lazily generated when calling into an instrumented function in a fiber for the first time too?
doesn't allow us touching any locals between the call to __sanitizer_start_switch_fiber(NULL, ...); and swapcontext().
this is a guess accidentally not a problem for us in Ruby because the bulk of our switching machinery (coroutine_transfer) is implemented in assembly. So we don’t actually have really any C code between the start and finish switches. Perhaps you should use an attribute of some kind on your stack switching code to disable Asan instrumentation on that function? (We should probably do that in CRuby too for the function that calls sanitizer_start_stack_switch too)
Is a fake stack not lazily generated when calling into an instrumented function in a fiber for the first time too?
OMG...
Well, yes it does. :)
Once you properly set the stack to NULL,
it gets re-allocated automatically. I made
this fix:
https://github.com/dosemu2/dosemu2/commit/03b92988d9a5a798c6e98e7e5e7b0ddbb673b6fa
while playing with your __asan_get_current_fake_stack();
trick, but I didn't realize that this fix alone
is enough, so I mistakenly applied also your
trick. :( Force-pushed it away now, its not needed.
So only the problem of touching locals stays...
So we don’t actually have really any C code between the start and finish switches.
No, this code: https://github.com/ruby/ruby/blob/a15e4d405ba6cafbe2f63921bd771b1241049841/cont.c#L1576-L1581 calls to sanitizer and then to swapcontext(). Its obviously that you use no locals in between. But add some, and you'll finally reproduce my bug report. :)
To understand why is this important,
please see this code of mine:
https://github.com/dosemu2/dosemu2/blob/03b92988d9a5a798c6e98e7e5e7b0ddbb673b6fa/src/base/lib/libpcl/pcl.c#L72
This exact line crashes.
I later use the fake_stack_save
var here:
https://github.com/dosemu2/dosemu2/blob/03b92988d9a5a798c6e98e7e5e7b0ddbb673b6fa/src/base/lib/libpcl/pcl.c#L82
While I can get the fake-stack-ptr from
the context structure, which would avoid
the problem, I follow the asan docs precisely:
https://chromium.googlesource.com/chromiumos/third_party/compiler-rt/+/google/stable/include/sanitizer/common_interface_defs.h#184
In most cases, this void* can be stored on the stack just before
switching.
So asan doc itself suggests storing that into a local, rather than via other means. And that suggestion breaks.
OK, actually I do realize that even more precise reading would reveal that in case of a NULL, nothing is actually stored in a local... I just use the NULL which I stored there manually. So in that case no one suggests me to use it. So I can't even blame the asan's suggestion here...
Do you think this should be re-classified
to a documentation problem, so that the
doc just say to not touch any locals after
__sanitizer_start_switch_fiber(NULL, ...);
?
IMHO its quite a bad thing to document though.
No, this code: https://github.com/ruby/ruby/blob/a15e4d405ba6cafbe2f63921bd771b1241049841/cont.c#L1576-L1581 calls to sanitizer and then to swapcontext(). Its obviously that you use no locals in between. But add some, and you'll finally reproduce my bug report. :)
Yes, I fully agree - this is only working in Ruby at the moment by accident.
I suppose we should mark fiber_setcontext
and fiber_entry
in Ruby with __attribute__((no_sanitize("address")))
. That I think would be the more correct thing to do - can't have ASAN enabled in the code which deals with switching the ASAN fake stacks around.
Do you think this should be re-classified to a documentation problem
Yes - I would like to contribute some documentation improvements to ASAN & Fibers but I actually don't know how. It's a github wiki and I don' t know how to propose changes. I think the docs should...
__sanitizer_start_switch_fiber
and __sanitizer_finish_switch_fiber
in terms of "either handing off ownership of the fake stack, or destroying it" like I did in my last message - IMHO that's the clearest way to think about it.__sanitizer_start_switch_fiber
before swapcontext, with a conditional to handle the dead case, one call to __sanitizer_finish_switch_fiber
in your fiber entrypoint to handle the new fiber case, and one call to __sanitizer_finish_switch_fiber
after swapcontext to handle the resuming-existing-fiber case).I mostly had to figure this stuff out for Ruby by trial and error and looking at the LLVM sources.
FWIW I had a look at your code in dosemu, it looks correct to me now, other than that I think you need __attribute__((no_sanitize("address"))
on co_switch_context
and maybe co_runner
.
Thanks for opening this issue, it's clearly very under-documented and It was useful for me to have a closer look at how this all works in Ruby too :)
Thank you very much!
__attribute__((no_sanitize("address")))
is indeed a very nice trick that I now applied.
With that trick, this problem is certainly
doc-only.
Another thing worth documenting, is the
one from https://github.com/google/sanitizers/issues/1760#issuecomment-2141728709
where @ioquatix says how the stack for
the main coroutine should be retrieved.
I.e. that __sanitizer_finish_switch_fiber()
should never have NULLs in the second
and third args.
There are a few more things to add.
I referenced 2 commits here which I
did in a separate branch.
One commit just removes the fake-stack
ptr! After our discussion it appears
redundant, because on a coroutine entry
you are supposed to invalidate the stack
by passing NULL to __sanitizer_finish_switch_fiber()
.
But I can just pass NULL explicitly, and no
need for the saved ptr at all.
Another patch adds no-sanitize attribute to the coroutine entry func. This is because the fake-stack is re-initialized here. Yes, the old one is not unmapped, so we do not observe any crash. But still the asan's view of the fake stack mismatches to the reality, so to avoid any problems I suggest adding an attribute also there.
Could you please review these 2 observations?
For my own understanding:
Another patch adds no-sanitize attribute to the coroutine entry func.
Does it only apply to the function but not the coroutine co->func(co->data);
?
But I can just pass NULL explicitly, and no need for the saved ptr at all.
Isn't this only when you want to free the internal fake stack?
I made another very important observation
(now also referenced here), which is that
@ioquatix suggestion to not have NULLs in
the second and third args of
__sanitizer_finish_switch_fiber()
are only
applied to the start of the coroutine, but
not to the return of some coroutine. Because
we never return from the main coroutine,
but we can start some from the main one.
So it appears:
__sanitizer_finish_switch_fiber(NULL, &ptr, &len)
__sanitizer_finish_switch_fiber(fake_save, NULL, NULL)
So basically they "complement" each other, and I'd rather add a separate function for starting a fiber.
If either (or all) of my observations are true, then this API should be documented entirely differently, making my observations explicit.
Does it only apply to the function but not the coroutine co->func(co->data);?
Yes, because co->func() is called later, on a
new fake-stack already. Whereas co_runner()
is on an old one. Its not unmapped only by
a very deep magic in the @KJTsanaktsidis 's logic. :)
Isn't this only when you want to free the internal fake stack?
No, here I am talking about
__sanitizer_finish_switch_fiber()
, not
__sanitizer_start_switch_fiber()
.
We call __sanitizer_finish_switch_fiber()
on a fiber start, and there we MUST pass
NULL as a first arg. Unlike
__sanitizer_start_switch_fiber()
it
doesn't unmap the fake-stack, only invalidates
it to allocate a new one later. We always need
to invalidate and switch the stack on fiber
entry, which leads to this "always-NULL"
observation.
Because we never return from the main coroutine,
A vague explanation. We can return from main coroutine to some other if that "other" called the main coroutine at some point. But that can't happen before we called to that "other" coroutine from the main one. So to catch the stack of the main coroutine, it is enough to do that on "other" coroutine entry.
It seems like these days detect_stack_use_after_return breaks fiber switching. When detect_stack_use_after_return is enabled, asan malloc's the "fake" stack and puts the locals there together with redzones. That process is (partially) documented here: https://github.com/google/sanitizers/wiki/AddressSanitizerUseAfterReturn That very same stack ptr is put into the first argument of __sanitizer_start_switch_fiber(). If detect_stack_use_after_return is disabled, then the "fake stack" machinery is not used, so __sanitizer_start_switch_fiber() always puts NULL into its first arg.
Now the problem is, the fake-stack is per-thread, not per-fiber. When some fiber exits, we put NULL into the first arg of __sanitizer_start_switch_fiber(), and that unmaps the entire per-thread fake-stack: https://gnu.googlesource.com/gcc/+/refs/heads/trunk/libsanitizer/asan/asan_thread.cpp#166 Which, as noted above, contains current locals and redzones. So all crashes.
Probably __sanitizer_start_switch_fiber() should allocate and free its own fake stacks, and not touch the per-thread one?