proposal: runtime: add a pointer check code to gcWriteBarrier for helping debugging zombie pointers

octobered commented 2 years ago

## Background Zombie pointer problem is very difficult to debug. Now GoGC will only report this issue in scanning objects, even though Go GC will panic here, information is not enough for debugging (reportZombies will print this span, but won't tell developer how this pointer collected to gcw, developer will have to find out where this memory address comes from.). Currently there are two ways that could put pointers to gcw: `greyobject` and `gcWriteBarrier` (and its cx, dx... derivation). `greyobject` already has `debug.gccheckmark` flag to validate if this pointer is ok, however `gcWriteBarrier` has no corresponding way to do the validation. For example, a misuse `atomic.StorePointer` won't have any errors reported, if developers don't pay attention to the correctness of this value. (Espcially when this field is an ID or some other int64 fields have no real world meaning) [code snippet](https://go.dev/play/p/J-HJCiZvlFg) ```go package main import ( "fmt" "sync/atomic" "unsafe" ) type struct1 struct { Field int64 } type struct1ptr *struct1 var s1 struct1ptr = &struct1{ 1, } func main() { f1 := struct1{ Field: 0, } atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(s1)), unsafe.Pointer(&f1)) fmt.Println(s1.Field) } ``` Moreover, if this function is working under heavy load, this wrong assign may leads to a zombie pointer which is hard to debug, [link](https://go.dev/play/p/tUVc81fYRAo) As this panic will only tell developers which `mspan` contains zombie pointer, but no information about where this pointer is allocated. ```go package main import ( "fmt" "runtime" "sync" "sync/atomic" "time" "unsafe" ) var current time.Time func main() { var wg sync.WaitGroup for i := 0; i < 1000; i++ { wg.Add(2) go func() { defer wg.Done() rc := time.Now() atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(¤t)), unsafe.Pointer(&rc)) }() go func() { defer wg.Done() fmt.Printf("%v\n", current.Unix()) }() runtime.GC() } wg.Wait() fmt.Println("===over===") } ``` Another case is unsafe type conversion. [playground](https://go.dev/play/p/4DGlXgSj0YZ) Same `UserInfo` from different RPCs may leads developer to use unsafe.Pointer to elimate new allocation performance loss and avoid writing too long code for mapping field. But after IDL files updated, `UserInfoFromARPC` and `UserInfoFromBRPC` may be different, and there is a high probability that the newly added fields will be of pointer type, since new fields are often optional. ```go package main import ( "fmt" "runtime" "sync" "unsafe" ) type UserInfoFromARPC struct { ID int64 Age int64 } type UserInfoFromBRPC struct { ID int64 Age int64 Extra *string // this field was introduced after UnsafeQueryB is finished, and generated by some idl generator } func Query() *UserInfoFromARPC { return &UserInfoFromARPC{ ID: 100, Age: 100, } } func UnsafeQueryB() *UserInfoFromBRPC { return (*UserInfoFromBRPC)(unsafe.Pointer(Query())) } func main() { var wg sync.WaitGroup for i := 0; i < 1000; i++ { wg.Add(1) go func() { fmt.Println(UnsafeQueryB()) wg.Done() }() runtime.GC() } wg.Wait() } ``` ## Proposal ### Idea in brief Add a macro-controlled code section in gcWriteBarrier, gcWriteBarrier will try to check pointer validation before store it. If the pointer is illegal, panic here. In general, the panic stack will have the frame of the wrong "allocation" happened function. Developer could use this to find their bugs easier. ### Why choose to modify gcWriteBarrier ZombiePointers will only happened when this wrong pointer has been esacped to heap. If a pointer doesn't escaped to heap, it won't cause zombie pointer at least. ### Why choose to using a macro instead of a flag As `gcWriteBarrier` is a frequently called function, using macro will avoid changing regular code. Only when developers find there is a zombie pointer in their program and it's reproducible, they could compile codes with `-+` and this flag to recompile runtime and enable these tracking code, and waiting for panic to find out where the bug is. ### Cons 1. Controlling compiled code by c macro isn't a common way in go > but maybe least performance overhead? PS: should we modify `reportZombies` to make it not panic when `invalidptr=0` is set?

ianlancetaylor commented 2 years ago

Go doesn't have macros, so it's not clear to me exactly what you are suggesting.

gcWriteBarrier is run very frequently, so I agree that we should not add a run-time check for whether to check that the pointer is valid.

Perhaps we could do this with build tags, but I have a feeling that few people would such a feature in practice.

CC @golang/runtime

ericlagergren commented 2 years ago

The first example seems like something that vet should catch. It is rather easy to either forget that the API is actually func StorePointer(dst **T, t *T) or to typo the unwieldily (*unsafe.Pointer)(...) expression. I make this mistake basically every time I use the API. A vet check would catch the problem at its source.

I think that it might be possible for vet to catch the second case. It would a false positive if UserInfoFromARPC were allocated off Go's heap (like C.malloc or syscall.Mmap), but that could be worked around by using //go:notinheap. But I don't know.

If vet could catch these then there wouldn't be a need for runtime checks.

cherrymui commented 2 years ago

FWIW, in the past for debugging I tried to set the write barrier buffer size to 1, so it will always go through the slow path, which has at least some check e.g. in findObject, and it was helpful. Perhaps we can make a GODEBUG mode that set the the buffer size to 1.

ianlancetaylor commented 2 years ago

I think that you can actually get the effect of a buffer size of 1 today by using GODEBUG=cgocheck=2. Though that's hardly intuitive.

guodongli-google commented 2 years ago

A vet checker looks plausible: it can strip the unsafe.Pointer() conversions and obtain the original types, and check whether the types of the two arguments:

atomic.StorePointer(T1, T2)

Valid cases include:

T1 = *T2
T1 = *T3, T2 is compatible with T3.

The check can easily go transitively over the types, e.g. when T2 and T3 are structs with fields of named types. So the checker may narrow down the patterns it will report, e.g. only report the case when T1 = T2, which covers the examples:

atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(s1)), unsafe.Pointer(&f1))

and

atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(&current)), unsafe.Pointer(&rc))

The frequency of these patterns in real code is a question though.

octobered commented 2 years ago

The first example seems like something that vet should catch. It is rather easy to either forget that the API is actually func StorePointer(dst **T, t *T) or to typo the unwieldily (*unsafe.Pointer)(...) expression. I make this mistake basically every time I use the API. A vet check would catch the problem at its source.

I think that it might be possible for vet to catch the second case. It would a false positive if UserInfoFromARPC were allocated off Go's heap (like C.malloc or syscall.Mmap), but that could be worked around by using //go:notinheap. But I don't know.

If vet could catch these then there wouldn't be a need for runtime checks.

I get this point, vet would be a less intrusive way. Actually I'm trying to developing a plugin under golangci-lint so that if anyone meets this problem, they could use it to debug. ( I think these codes are not very common, should I add it go go vet instead of golangci-lint?

However the actual problem I concern it's that two examples I gave are just coming from my personal knowledge, I don't know other cases that could trigger this.

I think that you can actually get the effect of a buffer size of 1 today by using GODEBUG=cgocheck=2. Though that's hardly intuitive.

Well they are kinda different actually, I think you mean when cgocheck=2, cgoCheckWriteBarrier will be called. cgoCheckWriteBarrier seems will not check it dst and src are both Go pointers, as this check depends on if this pointer smaller than span limit, which they are "in span" very likely. (These zombie pointers are smaller than span, but they are larger than freeindex and their allocBits are unset. They could be in span and zombie.)

Go doesn't have macros, so it's not clear to me exactly what you are suggesting.

gcWriteBarrier is run very frequently, so I agree that we should not add a run-time check for whether to check that the pointer is valid.

Perhaps we could do this with build tags, but I have a feeling that few people would such a feature in practice.

CC @golang/runtime

I do find out go rarely use macros, however gcWriteBarrier is assemble, I think I might could introduce a macro like GOAMD64_v2/3/4 and it's controlled by flags? (I haven't read related code, it's just my guess😂

octobered commented 2 years ago

and another thing I think maybe should be pointed out is invalidptr can not fully disable zombie pointer check. invalidptr will only affect if this pointer is collected by findObject, if this pointer is marked by gcWriteBarrier, this will just panic in reportZombies, which is not controlled by invalidptr.

I am not very sure but is it ok to make reportZombies controlled by invalidptr?

randall77 commented 2 years ago

Another "fix" for this problem is to just wait for #50860 to be accepted and implemented. Then we can have people move to using the generic (*Pointer[T]).Store which doesn't have the same confusion that the current StorePointer has.

golang / go

proposal: runtime: add a pointer check code to gcWriteBarrier for helping debugging zombie pointers #51276