golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.11k stars 17.68k forks source link

runtime: crash in hybrid barrier initializing C memory #19928

Open rsc opened 7 years ago

rsc commented 7 years ago

Received report via email of code crashing in write barrier finding a bad pointer (pointing to a free span). Code looks like:

keys := (*[1 << 20]*C.char)(unsafe.Pointer(C.malloc(C.size_t(uintptr(n)) * C.size_t(unsafe.Sizeof((*C.char)(nil))))))[:0:n]
values := (*[1 << 20]*C.char)(unsafe.Pointer(C.malloc(C.size_t(uintptr(n)) * C.size_t(unsafe.Sizeof((*C.char)(nil))))))[:0:n]
for key, vals := range req.Header {
    for _, value := range vals {
        keys = append(keys, C.CString(key))
        values = append(values, C.CString(value))
    }
}

The problem appears to be that C.malloc returns uninitialized memory. Then the initialization of that memory during append overwrites uninitialized pointer slots, and the new hybrid write barrier tries to follow the uninitialized pointers as if they were real pointers.

Prior to the hybrid barrier, the old data being overwritten for off-heap memory was never considered. Now it is. This will invalidate code like the above. It took a long time to notice, so apparently this is rare.

One workaround is to make C.malloc return zeroed memory, which would have a performance cost people might not like (especially since it doesn't seem to matter to most people?). Another workaround is to tell people to use C.calloc for allocating C data structures with pointers, or maybe rewriting just the C.malloc calls that mention pointer sizes.

Really the best fix would be for the GC not to follow these pointers. @aclements says there are some cases where do want the write barrier to follow pointers in non-managed-memory, specifically non-global off-heap runtime structures, but maybe we can distinguish these from other kinds of memory (like C memory) somehow.

If there is a simple fix and we receiver any reports of external problems along these lines, we can consider including the fix in Go 1.8.2.

rsc commented 7 years ago

Also, @ianlancetaylor, do you think this code should work as far as the Go pointer rules are concerned? There are no Go-managed data structures anywhere in sight, so it seems to me like they should.

rsc commented 7 years ago

/cc @RLH

aclements commented 7 years ago

Really the best fix would be for the GC not to follow these pointers.

One downside of this is that it will increase the cost of the write barrier, since we'll have to add a check for if the slot is in the heap or a global.

@aclements says there are some cases where do want the write barrier to follow pointers in non-managed-memory, specifically non-global off-heap runtime structures, but maybe we can distinguish these from other kinds of memory (like C memory) somehow.

I'm not sure off the top of my head how many of these there are. It's possible we could manually annotate the writes to have write barriers (with a write barrier that bypasses the slot check), though that seems error-prone, especially going forward. Or we could compile all write barriers in the runtime to the slot-check-bypassing write barrier, using the knowledge that the runtime never writes a pointer to an uninitialized structure.

ianlancetaylor commented 7 years ago

I agree that according to the current Go pointer rules this code is OK. Programs aren't permitted to store a Go pointer in C memory, and indeed that is not happening. Apparently the memory happens to contain a bit pattern that looks like a Go pointer, but it is not actually a Go pointer.

Changing C.malloc to return zeroed memory would fix this case but it wouldn't fix all cases of the general problem. It's currently permitted to call C code that allocates an array on the stack, fails to initialize it, and then calls a Go function with a pointer that the Go function uses to fill in elements. The Go function is not permitted to store Go pointers into the array, but it is permitted to store C pointers. That would invoke the write barrier on uninitialized memory from the C stack, which could potentially contain bit patterns that look like Go pointers.

We could extend the Go pointer rules to say that C memory that has a pointer type must be explicitly initialized if it is visible from Go. To make that simpler we could modify C.malloc to zero out memory, but that would only be necessary, and need only be done, when invoked with a pointer type. I don't see how to check that rule with cgocheck=1 but we could check it with cgocheck=2 (which does cgo checking at every write barrier).

aclements commented 7 years ago

To summarize in-person discussion: we're all concerned about increasing the write barrier cost for something that appears to almost never be a problem in practice. So, we're leaning toward tweaking the rules to say something like: if Go writes to pointers in the memory, it has to be initialized (this is already the case for pointer reads; this would extend it to pointer writes) and possibly extending the cgo interface to make the rules easy to stay in (perhaps providing C.new that takes a type and zeroes memory). But we'll wait to see if this comes up again before diving into a rule change.

bcmills commented 7 years ago

We could probably get at least some mileage out of whether the pointer has a C type or a Go type.

It seems reasonable to require that all memory accessed through Go types must be initialized (since Go values are normally zero-inititalized), but much less so for C types (since C out-parameters are often left uninitialized in idiomatic C code). With https://github.com/golang/go/issues/16623 it might be more plausible for the compiler to be aware of whether the type of a given slot is C or Go.

We could vet the C memory for Go pointers at the time of conversion to Go types (e.g. in a C.AsSlice builtin per https://github.com/golang/go/issues/13656#issuecomment-291947849) by pre-scanning all of the pointer slots to ensure that they are valid.

Then we could apply the more expensive write barrier only to pointers of C types, for which assignments are relatively rare (and often somewhat expensive anyway because they tend to correlate with transitions between Go and C stacks).

ianlancetaylor commented 7 years ago

@bcmills Perhaps, but I'm somewhat reluctant to further complicate the cgo pointer rules by adding rules that depend on the type used to access the memory, especially since when using cgo that type often varies between C code and Go code. For example, C code will often use void* where Go code will use unsafe.Pointer, so what is the correct handling when C code allocates an array of void* and calls a Go function that receives a *unsafe.Pointer and converts it to a []unsafe.Pointer? I'm not saying we can't write a rule but I'm concerned that we can't write a rule that people can actually understand and use.

bcmills commented 7 years ago

I think we would have to declare that unsafe.Pointer (by itself or with any number of pointer indirections) is a potentially non-Go type (that is, needs the more expensive write barrier). But it might be prudent to use a more robust write barrier for unsafe.Pointer anyway, since its use in application code increases the risk of subtle pointer errors.

So the rules would be, roughly:

Note that the code in Russ's example would be invalid under this rule: *[1 << 20]*C.char is not a Go type, but []*C.char is. (We would presumably detect it as invalid while evaluating the slice expression if GODEBUG=cgocheck is set to a sufficiently high threshold.)

I will leave it up to you to decide whether that rule is too complicated.

rsc commented 7 years ago

The current type system does not distinguish between C memory and Go memory and cannot be reinterpreted to do so.

bcmills commented 7 years ago

@rsc I don't think anyone is proposing that the type system should distinguish between C memory and Go memory. The rule I described above constrains initialization, not object location.

It's quite possible that there is a better choice of phrases than "Go type", but it wasn't obvious to me. The general idea is that a value of a "Go type" is always initialized by a Go statement or expression, whereas a value of a "non-Go type" may or may not be.

The current type system does distinguish between pointers and non-pointers, particularly with the use of unsafe.Pointer and across cgo boundaries. In the case of C out-parameters we have a third kind of data ("non-pointer stored in a pointer type"), and the proposed rule is an attempt to fit that into the existing type system (by allowing non-pointers to be stored in pointer slots of "non-Go" types).

broady commented 7 years ago

Bumping to Go 1.8.4 (or maybe Go 1.9)

rsc commented 7 years ago

Are we just not going to fix this for Go 1.8, @aclements?

aclements commented 7 years ago

Since we haven't received any external reports of this being a problem and there are ways to work around it, I would lean toward not fixing this for 1.8.

It's also still unclear to me what the fix would be. We decided in April that we didn't want to make the write barrier more expensive (and complicate the runtime) to detect whether the slot is outside Go-managed memory, especially without evidence that this is actually a problem. So that leaves changing the rules (and maybe making C.malloc zero memory and/or introducing C.new).

Either way, doesn't seem like a 1.8 issue, so I'm re-milestoning it to 1.10.

bcmills commented 7 years ago

So that leaves changing the rules (and maybe making C.malloc zero memory and/or introducing C.new).

I'm more concerned about the implications for passed-in buffers than for the Go C.malloc wrapper itself: it's easy enough for Go callers to use calloc explicitly if they know they need to, but if we're exporting a Go API to C, it won't necessarily be obvious what kind of transitive zero-initialization requirements that would impose on C callers.

odeke-em commented 6 years ago

Hello @aclements, how's it going? Neither have we received any reports, in between the Go1.9 to Go1.11 cycles. Perhaps do you think we should bump this for Go1.12 or later?

ianlancetaylor commented 6 years ago

Another likely example of this problem in #26288.

ianlancetaylor commented 6 years ago

26288 is in fact exactly the kind of case @bcmills discussed in https://github.com/golang/go/issues/19928#issuecomment-336998313 .

ianlancetaylor commented 6 years ago

Perhaps we could partially alleviate, if not actually fix, this problem by adding an additional check step before we complain about a bad pointer. We can check where we found the bad pointer: if we found it in memory that is not on the Go heap, then we just discard it.

RLH commented 6 years ago

The write barrier currently discards the slot before this check is done. We would have to add the slot to the write barrier buffer or perhaps do the check earlier. Either way comes with some per-write barrier cost.

chfast commented 6 years ago

Please at least add a not about this issue to the cgo documentation.

aclements commented 6 years ago

The write barrier currently discards the slot before this check is done. We would have to add the slot to the write barrier buffer or perhaps do the check earlier. Either way comes with some per-write barrier cost.

Thinking about this a bit more, here's a possibility: rather than recording the "old" and "new" in the write barrier buffer, record "old" and "slot". This may actually make the write barrier fast path slightly faster since it un-constrains one register around all pointer writes. Then, during flushing, we read the "new" value from the recorded slot. This is likely to be slightly slower since it's less likely to be in cache at this point (though only slightly). We have to resolve this address to an object anyway, so if this resolution fails, we further scrutinize the slot address. If the slot address is outside Go memory, we also ignore the corresponding "old" pointer. This leans on the rule that you're not allowed to write a Go pointer into C memory.

One remaining loose end would be nil pointers. Is Go code allowed to write a nil pointer into C memory? If so, do we need to scrutinize the slot address when we see a nil write?

bcmills commented 6 years ago

Is Go code allowed to write a nil pointer into C memory?

Yes.

If so, do we need to scrutinize the slot address when we see a nil write?

Not sure.

aclements commented 6 years ago

To be a little more concrete, here's pseudo-code for what I'm proposing:

have old, slot from write barrier buffer
new := *slot
newObj := findObject(new)
if newObj == nil {
    if findObject(slot) == nil && slot is not in Go data segment {
        Non-Go pointer stored into non-Go memory. Ignore new and old.
        continue
    }
}
shade(newObj)
shade(findObject(old))

Notably, this never inspects old or tries to guess whether it looks like a garbage value. Pointer writes that could lead to a garbage old are discarded on the basis of writing a non-Go pointer (the new value) into non-Go memory (the slot address). We don't even try to shade old.

Unfortunately, while this avoids the slow-path slot check for most non-nil non-C pointer writes, it still triggers that path for writes of nil to Go memory, which are obviously very common, as well as writes of pointers to Go globals into Go heap memory.

gopherbot commented 6 years ago

Change https://golang.org/cl/123658 mentions this issue: cmd/cgo: add note about bug writing C pointers to uninitialized C memory

chfast commented 6 years ago

Please note that a C function struct S f() is transformed by the compiler to struct S* f(struct S*) and actually called as

struct S _;
f(&_);

Users don't have control over it, so this means that returning C structs from "exported" Go functions can also cause troubles.

ianlancetaylor commented 6 years ago

@chfast I'm likely missing something but I don't see how that can cause a problem. The transformation that you describe is done implicitly by the C compiler. The problems described here only arise in Go code. The Go code is going to return a struct in the usual way for Go code, with the usual zero initialization of the result as neeed. The fact that the C code will then copy that into uninitialized C memory doesn't matter.

chfast commented 6 years ago

Sorry, I incorrectly assumed that "exported" Go functions have C ABI.

ianlancetaylor commented 6 years ago

@chfast Behind the scenes, C code that calls an exported Go function actually calls a C wrapper, generated by cgo, that calls the Go function using the Go ABI and translates the results into the C ABI.

rsc commented 6 years ago

Nothing more here for Go 1.11.

pdefrain commented 2 years ago

We have an implementation that panics, Go 1.18, frequently (panics within minutes of any restart when loaded). The process is C++, RHEL/8, intel compiler, that loads a Golang shared object. The interface between C++ and Go is one that passes a byte*/[]byte back and forth using a Go channel for each direction. Messages are marshalled into buffers which are all allocated from C.malloc and un-marshaled into message objects in the other language. There is no other interaction between C++ and Go. Messages queued from C++ to Go run on a C++ thread that marshals the message, then un-marshals it into a Go message and puts it into a Go channel for goroutines to take. Messages from Go to C++ are queued on a Go channel by a goroutine, and taken off that channel using a C++ thread that dequeues the Go message from the channel, marshals it into a byte[], then un-marshals it into a C message. Marshaling is used as a means to avoid type converting - unsafe.Pointer is used along with a byte size for the un-marshaling functions - the unsafe.Pointer always being C.malloc memory.

Many variations on this theme have been tried, as well as variations on which heap to use for the byte pointers. The Go channels have even been removed and replace with C.malloc shared memory. Always the same problem occurs.

(gdb) bt

0 runtime.raise () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/sys_linux_amd64.s:165

1 0x00007f0617a20a25 in runtime.dieFromSignal (sig=6) at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/signal_unix.go:769

2 0x00007f0617a20c78 in runtime.crash () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/signal_unix.go:861

3 0x00007f0617a0b5b1 in runtime.fatalthrow.func1 () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/panic.go:1257

4 0x00007f0617a0b530 in runtime.fatalthrow () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/panic.go:1250

5 0x00007f0617a0b2f1 in runtime.throw (s=...) at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/panic.go:1198

6 0x00007f0617a2ea96 in runtime.gentraceback (pc0=139664142949814, sp0=824634060688, lr0=, gp=, skip=0, pcbuf=0x0, max=2147483647, callback=

{void (runtime.stkframe *, void *, bool *)} 0x7f060d824a18, v=0x0, flags=0) at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/traceback.go:274

7 0x00007f06179f4e57 in runtime.scanstack (gp=0xc000000b60, gcw=0xc000045698) at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/mgcmark.go:748

8 0x00007f06179f3d91 in runtime.markroot.func1 () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/mgcmark.go:232

9 0x00007f06179f3b50 in runtime.markroot (gcw=0xc000045698, i=22) at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/mgcmark.go:205

10 0x00007f06179f59b9 in runtime.gcDrain (gcw=0xc000045698, flags=3) at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/mgcmark.go:1013

11 0x00007f06179f2a05 in runtime.gcBgMarkWorker.func2 () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/mgc.go:1269

12 0x00007f0617a39146 in runtime.systemstack () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/asm_amd64.s:383

13 0x0000000000000000 in ?? ()

The SIGABRT comes from a GC thread. The GC code is generating a traceback in what appears to be a stack scan and the traceback code panics when it fails to identify a function pointer. I don’t know exactly why the panic generating gentraceback is in the “middle” of the GC code but it appears related to resolving defer operations (it’s not generating a traceback for logging, but, to identify what’s going on in the call stack it’s searching). It appears the GC code may be misusing frame.lr in stack frame 7 along the path executing (which ends in the fatal panic).

                    flr = findfunc(frame.lr)
                    if !flr.valid() {
                            // This happens if you get a profiling interrupt at just the wrong time.
                            // In that context it is okay to stop early.
                            // But if callback is set, we're doing a garbage collection and must
                            // get everything, so crash loudly.
                            doPrint := printing
                            if doPrint && gp.m.incgo && f.funcID == funcID_sigpanic {
                                    // We can inject sigpanic
                                    // calls directly into C code,
                                    // in which case we'll see a C
                                    // return PC. Don't complain.
                                    doPrint = false
                            }
                           if callback != nil || doPrint {
                                    print("runtime: unexpected return pc for ", funcname(f), " called from ", hex(frame.lr), "\n")
                                    tracebackHexdump(gp.stack, &frame, lrPtr)
                            }
                            if callback != nil {
                                    throw("unknown caller pc")

When the panic occurs, I find that frame.lr = 2, or 3, or 0 -- generally a small integer value. It could be there is code not setting frame.lr = 0 before the code gets here to avoid the valid() check.

I've found that if I modify the above code, to detect what appears to be an invalid pointer in frame.lr, and not check it, that our code runs fine w/o any memory growth for long periods of time. But still, it eventually panics, possibly because the invalid pointer check is only an approximation as opposed to being a proper fix.

ianlancetaylor commented 2 years ago

@pdefrain Please open a new issue for your problem rather than adding on to this rather old issue. Your problem does not sound the same as the problem described in this issue, and I'm not even sure that this issue is still broken.

In your new issue please include reproduction instructions if possible. Thanks.

mjd95 commented 6 months ago

@pdefrain Please open a new issue for your problem rather than adding on to this rather old issue. Your problem does not sound the same as the problem described in this issue, and I'm not even sure that this issue is still broken.

Just wanted to confirm that this one is still an issue, at least in Go 1.22.1. We were hitting this with some code similar to @rsc 's original example, and worked around by switching C.malloc to C.calloc.