golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.87k stars 17.65k forks source link

proposal: expose the `internal/runtime/sys.Prefetch` intrinsic #68769

Open rip-create-your-account opened 2 months ago

rip-create-your-account commented 2 months ago

Proposal Details

I'm working on a concurrent hash map and with me having experience in using them in highly concurrent contexts with high throughput requirements I knew to add batch Put and Get methods to it. Such API design exposes the very obvious opportunity for prefetching the data.

To experiment with the impact of internal/runtime/sys.Prefetch on the batch Get call I modified runtime/panic.go to expose the intrinsic. go tool objdump then confirmed that a PREFETCHT0 instruction is emitted into the binary as expected.

// in runtime/panic.go
func Prefetch(addr uintptr) {
    sys.Prefetch(addr)
}

// and using it like this
for _, bucket := range initial_buckets_for_keys {
    runtime.Prefetch(uintptr(unsafe.Pointer(bucket)))
}

Looking up 100k distinct entries from a hash map with 5000k entries became significantly faster.

old: BenchmarkLookupBatches/size=100000-8                      63.14 million_lookups/s
new: BenchmarkLookupBatches/size=100000-8                      96.64 million_lookups/s

Which looks like a worthwhile optimization for me.

I personally like how for me this runtime.Prefetch Just Works™ so I propose adding just it. I also propose to not tell of it's addition in the change log so people don't use it willy-nilly. Just imagine the damage that all the blog spam will cause in the overall ecosystem, possibly leaking into other language communities too.

Jorropo commented 2 months ago

This signature has less pitfalls:

func Prefetch[E any, P ~*E | ~unsafe.Pointer](P)

Does this have any memory model interactions ? For example does anything happens if we call prefetch concurrently of writes to the same address ?


I also propose to not tell of it's addition in the change log

:-1:

so people don't use it willy-nilly.

:+1: I think it would be more effective if we put it in unsafe.Prefetch.


Lastly I'm not sure how easily portable to other architectures that would be.

ianlancetaylor commented 2 months ago

Does this have any memory model interactions ? For example does anything happens if we call prefetch concurrently of writes to the same address ?

There are no memory model interactions here. A prefetch by definition does not change the meaning of a program, it only affects the execution time.

On an architecture that doesn't have a prefetch instruction, the proposed function would be a no-op.

It's worth noting that amd64 has at least six different kinds of prefetch instructions. The GCC compiler __builtin_prefetch intrinsic permits specifying up to three arguments: the address, a read/write flag, and locality. The read/write flag is 0 for a read, 1 for a write. The locality ranges from 0, meaning no temporal locality and there is no reason to cache the data at all after the actual memory access, to 3, meaning high locality and the value should be left in all caches if at all possible.

Jorropo commented 2 months ago

There are no memory model interactions here.

I thought it would be able to generate pagefaults which can be turned into callbacks with signals but after going over the docs at least on amd64 it does not.

On an architecture that doesn't have a prefetch instruction, the proposed function would be a no-op.

I'm more worried about architectures that don't implement it the same way. This looks like the kind of things that could improve on a particular CPU, but pessimize on an other one.

rip-create-your-account commented 2 months ago

@Jorropo

This looks like the kind of things that could improve on a particular CPU, but pessimize on an other one.

This is easily solved by only using it when it's useful to use. General purpose data structures are unlikely to need this. I only found one place in the Go project that actually uses sys.Prefetch currently and that is a very specialized data structure. Meanwhile plenty of recent literature shows that in huge hash maps it is useful especially when mixed with batch operations.

I thought it would be able to generate pagefaults which can be turned into callbacks with signals but after going over the docs at least on amd64 it does not.

In my use case the bucket pointers for which I'm prefetching are under certain rare conditions nil. Prefetch(uintptr(0)) must work without "affecting program behavior."

@ianlancetaylor

It's worth noting that amd64 has at least six different kinds of prefetch instructions...

The current situation is that there's a small set of {me, myself, I} who are asking for the exposed prefetch instrinsic and We happen to only need PREFETCHT0. And what We are working on is just a research experiment so in a sense a waste of time to begin with. Overcomplicating this proposal with the full prefetch capabilities of AMD64 and ARM64 could have unforeseen consequences. Interests of Ours are in having just enough low friction machinery to prototype data structures and algorithms of interest that then can be ported over and polished in more eloquent languages.

rip-create-your-account commented 2 months ago

Bordering off-topic but apparently some mutex implementations use PREFETCHW when spinning. AMD64 manual does say that

"... However, prefetching write data takes longer than prefetching read data if the processor must wait for another caching master to first write-back its modified copy of the requested data to memory before the prefetch request is satisfied."

which one could read as stalling until the other CPU core is done with the cache-line, possibly yielding the CPU time to a virtual core when hyper-threading is enabled. It also sounds useful in my use case considering that all of my buckets happen to carry a sync.Mutex and the very first thing that I do with buckets is to mu.Lock() it. But I can also see ways in how it could go wrong compared to the smooth brained PREFETCHT0. The result of a biased coin flip tells me that the addition of PrefetchInAnticipationOfWriteThatWillComeSoonEnoughJustAFewLinesBelow() is in place.