ivmai / bdwgc

The Boehm-Demers-Weiser conservative C/C++ Garbage Collector (bdwgc, also known as bdw-gc, boehm-gc, libgc)
https://www.hboehm.info/gc/
Other
2.98k stars 407 forks source link

New kernel API proposal (NetBSD) #207

Open krytarowski opened 6 years ago

krytarowski commented 6 years ago

I'm researching an addition of a new kernel API in NetBSD. My motivation is to support the StopTheWorld() operation that is used in few types of G/C (including bdwgc) and in LLVM LSan.

pinspect(2) - process self-introspection

int
pinspect(int request, void *addr, int data);

Operations:

This interface resembles a little bit stripped down ptrace(2) for self-debugging.

Does this make sense? The current implementation of StopTheWorld() in LSan for Linux is awful: vfork(2)/clone(2) + ptrace(2) + editing shared data.

I would like to reuse the same interface for bdwgc as well. I expect that the kernel API will be immune to threading issues (as long as we have a correct kernel), independent from pthread(3), invisible to the world (like ps(1), top(1)), not-conflicting with debuggers (ptrace(2)), simple to use.. and hopefully quick for multiple threads.

I intend to craft something that will be maximally useful and if possible open for other use-cases (suggestions?) of process self-introspecting.

https://github.com/llvm-mirror/compiler-rt/blob/8d93d94612443a03d2e211d944e3097f66ba75d5/lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cc

ivmai commented 6 years ago

+ @wangp @cpu82 @hboehm @magv @Constellation

krytarowski commented 6 years ago

Ping? I think I will add a ptrace wrapper for C and C++ programs.

krytarowski commented 5 years ago

@thorpej ^

thorpej commented 5 years ago

API looks completely reasonable. I would definitely suggest using the standard context structure, and would also suggest adding a size_t length argument.

thorpej commented 5 years ago

Wait, what's the distinction between "request" and "data"? what is the "data" argument for?

thorpej commented 5 years ago

So, the implementation of this in the kernel should be fairly straightforward... 1- Lock the processes's list of LWPs, Mark the process as inspect-pending, and the inspector as curlwp. 2- Send a "pinspect" IPI to all CPUs that have one of the target process's LWPs running. 3- IPI simply returns, and the LWPs all pause in userret() using a kcondvar. 4- When inspector resumes them, it clears the "inspector" field and wakes everyone up.

That's kind of hand-wavy, but the basic idea should work.

krytarowski commented 5 years ago

API looks completely reasonable. I would definitely suggest using the standard context structure, and would also suggest adding a size_t length argument.

ucontext_t has a shortcoming that it does not contain FPU registers (at least on x86) does not contain all FPU registers (at least on 64-bit x86 with XSAVE..).

But maybe it's not needed? Looking at LLVM it's not used. Also in libgc it's not used. So we should be safe.

Wait, what's the distinction between "request" and "data"? what is the "data" argument for?

It's the same as in ptrace(2).

request = operation like enable/disable/getregs data = either unused or auxiliary like LWP ID for getcontext (similar to getregs for ptrace(2))

I can add size_t but on the other hand we have no issues with lack of it in ptrace(2). It's also missing in getcontext(2). So I would skip it. The struct does not change its size on upgrades on the same NetBSD port.


What do you think about this API in pthread(3) for end-user:

int pthread_stop_world_np(void);
int pthread_start_world_np(void);
int pthread_getcontext_np(ucontext_t *ucp);

CC: @ivmai Do you think that the above pthread(3) extensions would integrate nicely into bdwgc? Are there any other missing operations or operations that could be done better/differently?

As far as I can read in the documentation it should do the trick.

The essential functionality consists of:

GC_stop_world - Stops all threads which may access the garbage collected heap, other than the caller; GC_start_world - Restart other threads; GC_push_all_stacks - Push the contents of all thread stacks (or, at least, of pointer-containing regions in the thread stacks) onto the mark stack.

-- https://github.com/ivmai/bdwgc/blob/7c063c84039c95bd065a35732311646c69c3026e/doc/porting.md

So a potential implementation would look like:

GC_INNER void GC_stop_world(void)
{
pthread_stop_world_np();
}

GC_INNER void GC_start_world(void)
{
pthread_start_world_np();
}

GC_INNER void GC_push_all_stacks(void)
{
/*
It looks like we need to push data region between `stack_base`
and `stack ptr`.

1. for stack_base either pick from ucontext->uc_stack.ss_sp,
store the value on a thread's creation or grab it dynamically with
pthread_attr_get_np(3) and pthread_attr_getstackaddr(3).

2. for stack_ptr call pthread_getcontext_np for each pinspected
thread and read SP register (with `_UC_MACHINE_SP`;
we might want to do some redzone gymnastics on the kernel-level
when needed)

3. transfer stack_base - stack_ptr to destination

4. repeat for each thread

5. current thread/main thread handle individually

 */
}

In general the current implementations are complex with many per-OS/CPU workarounds.

With the proposed API we won't use any signals in the implementation and thus making it much nicer for using under a debugger. Something that also matters is a potential performance win.


LLVM LSan

According to my understanding the only missing API feature is the list of suspended threads.

int pthread_start_world_np(void); -> int pthread_start_world_np(pthread_t *threads, size_t *num);

Whenever != NULL, allocate and return a buffer with a pthread_t array with num elements.

Probably LLVM could use low-level API pinspect for this purpose as pthread(3) level is not necessary.

Maybe @dvyukov might have some insight? I'm looking for feedback!

Anyway regardless of LLVM, getting API to list the threads and their number is useful. Performing such operation out of pinspect(2) context would be too racy to implement in pthread. Linux uses pthread_kill_other_threads_np for such things and it looks like a gross hack.

Another interesting API that would be sometimes useful is lwpid_t pthead_getthreadid_np(pthread_t) (but with a distinguishing name as FreeBSD has a clash here for a little bit different prototype - maybe pthread_get_threadid_np?). At times we want to run sysctl(3) against pthread_t to read some properties of a thread.

@thorpej what do you think?


So, the implementation of this in the kernel should be fairly straightforward... 1- Lock the processes's list of LWPs, Mark the process as inspect-pending, and the inspector as curlwp. 2- Send a "pinspect" IPI to all CPUs that have one of the target process's LWPs running. 3- IPI simply returns, and the LWPs all pause in userret() using a kcondvar. 4- When inspector resumes them, it clears the "inspector" field and wakes everyone up.

That's kind of hand-wavy, but the basic idea should work.

I will give it a try! Originally I was thinking about changing some scheduler code.. but blocking in userret() is probably better/simpler.

ivmai commented 5 years ago

Do you think that the above pthread(3) extensions would integrate nicely into bdwgc?

Seems yes. This is even simpler than GC_OPENBSD_UTHREADS case.

repeat for each thread

We should repeat only for each registered thread (the one registered by GC_register_my_thread_inner).

dvyukov commented 5 years ago

Yes, I can confirm StopTheWorld for linux is awful :) Compiler-rt can definitely use raw syscalls. We even prefer that because we intercept pthread, and we generally don't want recursion.

int pthread_start_world_np(void); -> int pthread_start_world_np(pthread_t threads, size_t num);

You mean pthread_stop_world_np, right? Or it does not make sense to me.

Lsan and GC's will want the list, so there should be a way to obtain it. It does not have to be part of this API, if it's already available elsewhere. But we need to make sure that that other API works with pinspect (e.g. doesn't deadlock).

An interesting question is if we want/can support multiple thread doing this at the same time (e.g. bdwgc verified by tsan, don't tell me it's not checked with tsan). I guess it can be resolved by suspending threads on LWP list lock? For linux it would mean that the lock should be a semaphore rather than a low-level spinlock.

Another interesting possibility is modifying the contexts :) Does not have to be in version 1. But it opens some very interesting possibilities.

This intentionally does not support suspending another process? For these cases users are supposed to use ptrace?

Is there anything else in particular you want my feedback on?

krytarowski commented 5 years ago

Do you think that the above pthread(3) extensions would integrate nicely into bdwgc?

Seems yes. This is even simpler than GC_OPENBSD_UTHREADS case.

Perfect!

repeat for each thread

We should repeat only for each registered thread (the one registered by GC_register_my_thread_inner).

This is right.

Yes, I can confirm StopTheWorld for linux is awful :) Compiler-rt can definitely use raw syscalls. We even prefer that because we intercept pthread, and we generally don't want recursion.

The mechanism with all the callbacks and iterators is pretty complicated. I'm trying to figure out if the syscall API is complete. Just making sure. I will go for raw syscalls in the LSan case.

int pthread_start_world_np(void); -> int pthread_start_world_np(pthread_t threads, size_t num);

You mean pthread_stop_world_np, right? Or it does not make sense to me.

Right. 'stop_world' is for starting the Process self INtro SPECtion operation.

Lsan and GC's will want the list, so there should be a way to obtain it. It does not have to be part of this API, if it's already available elsewhere.

It looks like GC keeps track of its own list and LSan (or sanitizers in general) does not need to operate on the pthread(3) level so I can pick native (LWP) threads with sysctl(3).

I think listing and counting pthread(3) threads is in general useful operation, sometimes we would perform this operation from a shared library and we don't control all the threads in a program.

But we need to make sure that that other API works with pinspect (e.g. doesn't deadlock).

There are no other APIs for it on NetBSD. There used to be libpthread_dbg(3) with this property, but it was designed for Solaris-style threads scheduled in userspace (M:N threading model). I've removed it few years back when I started to work on ptrace(2). Today we support 1:1 threading model and can prompt the kernel for most properties of a thread (like signal masking).

I think that the only interesting properties of libpthread_dbg(3) today are: listing threads with a callback mechanism (we can get more convenient array with pthread_stop_world_np()) and associating pthread_t (POSIX thread entity) with lwpid_t (native thread).

An interesting question is if we want/can support multiple thread doing this at the same time (e.g. bdwgc verified by tsan, don't tell me it's not checked with tsan). I guess it can be resolved by suspending threads on LWP list lock? For linux it would mean that the lock should be a semaphore rather than a low-level spinlock.

This will be handled in the kernel. All threads (LWP) in the kernel will access a bit field private to a process and use appropriate synchronization mechanism. This will ensure that no two LWPs are allowed to perform pinspect(2) in the same time, always one of them will win and the other will wait for its completion.

Another interesting possibility is modifying the contexts :) Does not have to be in version 1. But it opens some very interesting possibilities.

Please elaborate. Do you mean to implement PI_SETCONTEXT (ucontext)? What would be the use-case for this? I think that it could be used for implementing coroutines.. but in general they don't work nicely in C with POSIX interfaces. I'm not sure if it would be useful for other languages like golang or erlang.

Implementing PI_SETCONTEXT in one go is be pretty simple.

In general I'm searching for real-life use-cases where to put the new API now in existing software. Is there a room for it in sanitizers? I would love to know that there it is, so I can add it.

This intentionally does not support suspending another process? For these cases users are supposed to use ptrace?

In general yes. This is only for the current process.

Frida.re is a use-case for calling StopTheWorld for a remote process... however they also need to be able to remotely:

https://github.com/frida/frida/issues/429

This means that probably the best way to achieve this is to ptrace(2), rewrite .text segment of the target process and if needed call pinspect(2) as a part of injected code.

Is there anything else in particular you want my feedback on?

I'm drafting a kernel + libpthread implementation. Once I will get it functional and reviewed by @thorpej, I will write proof-of-concept implementation in bdwgc and LSan.

Thank you for your feedback!

dvyukov commented 5 years ago

Please elaborate. Do you mean to implement PI_SETCONTEXT (ucontext)?

Yes. But I don't have anything concrete in mind, it was more stimulate others to think :) So let's ignore this for now. I was thinking about cases like code hotpatching where one may want to divert threads from a particular section of code. Or parking subset of threads for a longer period, but otherwise resuming process operation; i.e. patching PC for subset of threads and then doing starting the world. Or doing other weird things for maybe user-space rcu or accessing TLS of other threads. Go runtime currently does everything synchronously (via compiler-emitted code) and does not resort to signals/ptrace and similar things. Tight loops are still not preemptable in Go (https://github.com/golang/go/issues/10958), 2 alternatives were considered: emitting checks on all back edges or using signals. Explicit checks in loops turn out to be quite expensive, so that can shift the choice to signals. But that's only for 1 thread. But then if this facility is in place (stack/register maps for every instruction), then maybe it will make sense to use it (immediate async preemption) for GC as well. I don't know, I did not follow the discussion recently. If you want more concrete answer, ask on the issue or on golang-dev@.

krytarowski commented 5 years ago

Yes. But I don't have anything concrete in mind, it was more stimulate others to think :) So let's ignore this for now.

It looks interesting but changing state of other threads is rather risky and I can envision some pitfalls. Probably I would need to enforce all LWPs to wait in userret() so they won't change their register frame context anymore, but some syscalls are noninterruptably blocking.. so it might lock up.

I need real life user first for this optimization or simplification. Coroutine-like code doesn't match the POSIX APIs in C, and this isn't trivial in non-C runtimes either (rust gave up and abandoned it).

For golang affairs I'm not that familiar with its runtime implementation, but I would prefer with our current resources to not diverge from other upstream platforms. Unless we will get someone dedicated to hack it, we mostly keep it alive for NetBSD. However potential issues in implementation complexity stand I think. I would prefer to keep setcontext(2) for current thread only.

dvyukov commented 5 years ago

Ack. Then never mind.

krytarowski commented 5 years ago

Ack. Then never mind.

Thank you for the input! Such crazy things with live patching, aot, jit etc are easiest with bytecode runtimes. (I know about c++20 plans to include them... but well we can defer discussing it for some time.)

So the proof that pinspect(2) works is to get libgc code example running with compiler instrumentation and runtime for LSan.

@thorpej also mentioned cases of robust features in pthread(3) to take into account, I will wait for his comments on it and keep working on the currently designed API.

krytarowski commented 5 years ago

@thorpej Today I have got a workstation outage and after attempts to repair it myself I had to finally go with it to a repair service, hopefully it will be back tomorrow.

I have got a snapshot of my work in git branch from yesterday. It contains a kernel implementation for pinspect(2).

https://github.com/krytarowski/src/blob/pinspect/sys/kern/sys_pinspect_common.c#L65-L117

Basically I need assistance how to get pinspect_enable() to work correctly with the IPIs. My usage of kernel preemption and SPLs was speculative.

Please review and tell me how to change it.

Here is userret():

https://github.com/krytarowski/src/blob/pinspect/sys/sys/userret.h#L92-L96

This code is still only build tested and boot tested. My pthread(3) implementation was incomplete and I got some more insights on the implementation of pthread_stop_world_np().

So far my thoughts:

My implementation of pthread(3) didn't land GitHub and was still kept locally on the machine that is in outage, but my kernel initial version should be fine to push to proof reading. I was also wondering whether it would be better to take rwlock in pinspect(2) instead of using a kcondvar... I'm all ears.

krytarowski commented 5 years ago

It's too late for -9 and there are still other kernel issues detected with ptrace(2). I don't want to add an interface that might share fragility.

For now I will upstream ptrace(2)-based solution for LSan (that we have locally) and once done with debuggers implement pinspect(2).