golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.61k stars 17.61k forks source link

proposal: sync: add GetLocal() method to `sync.Pool` in order to return only P-local objects #65104

Open valyala opened 9 months ago

valyala commented 9 months ago

Proposal Details

Currently sync.Pool.Get() works in the following way:

  1. It returns P-local object if there is one.
  2. Otherwise it steals an object from other P workers if there are no P-local objects.

This logic works great on systems with small number of CPU cores, but it may significantly slow down on systems with many CPU cores because of the following reasons:

It looks like that the way to go is to disallow stealing objects from other P workers. But this may slow down some valid use cases for sync.Pool, when Pool.Get() and Pool.Put() are called for the same pool from different sets of goroutines, which run on different sets of P. Disallowing stealing objects from other P workers will result in excess memory allocations at Pool.Get() call side, since P-local cache will be always empty.

Given these facts, it would be great to add GetLocal() method to sync.Pool, which must return nil if the object isn't found in P-local cache, without an attempt to steal the object from other P caches. Then the GetLocal() method can be used instead of Get() in performance-critical code, which calls Put() for the object returned from the pool on the same goroutine (and P) most of the time.

Update: the current implementation of sync.Pool.Put() already puts the object only to P-local storage, but this isn't documented and can be changed in the future, so it may be worth adding PutLocal() method in order to make the API more consistent.

Summary

Jorropo commented 9 months ago

To me it looks like a worst version of #18802 because it unnecessarily tie this logic to sync.Pool and it's implementation.

valyala commented 9 months ago

To me it looks like a worst version of https://github.com/golang/go/issues/18802 because it unnecessarily tie this logic to sync.Pool and it's implementation.

The current proposal is aimed at solving a single practical task - to have an ability to efficiently use sync.Pool on systems with big number of CPU cores, when Pool.Put() is called at the same goroutine as Pool.Get() after performing some CPU-bound task. It doesn't require changing the semantics of sync.Pool. It just adds an additional method - GetLocal() with very clear purpose - to return P-local object, if the pool contains it, otherwise to return nil.

The #18802 proposal, on the other hand, has completely different purpose - to provide access to per-CPU objects. This proposal has no clear vision yet, not saying about the implementation. For example, as I understood from the #18802 proposal, it wants creating some magic sharded object, which will have exactly one shard per P. On the other hand, the sync.Pool allows storing multiple local objects at the same P. It also allows dropping the stored objects at any time, so the user must expect that sync.Pool.Get() or sync.Pool.GetLocal() may return nil.

Jorropo commented 9 months ago

@valyala thx for the explanation in what way it is different. To me they still seem extremely similar, I could create GetLocal with #18802 by storing a pointer to a slice in the sharded value. (so can store more than one value).

About your proposal I don't think returning nil feel sync.Poolish, imo it should call sync.Pool.New.

qiulaidongfeng commented 9 months ago

Note that if sync.Pool also supports stealing objects from other P cache, a goroutine Local store or P store may be required, to prevent one goroutine from getting objects in a P-local and another goroutine from stealing objects from that P-local., so that {Get,Put}Local does not require atomic operation synchronization, because this way, the same block of memory will not have unlocked or not have atomic synchronized reads and writes.

prattmic commented 9 months ago

cc @aclements @mknyszek

aclements commented 9 months ago

You've identified an interesting issue, but I don't think putting the problem on the user by adding an API is the right answer. I think it would be really unclear to users when to call Get versus GetLocal, and would likely depend on the properties of the system their code is running on.

Instead, I think we could improve sync.Pool's implementation without exposing this complexity to users. For example, Get is always allowed to return a new object, so we could certainly implement a policy where it steals less aggressively (or even never!) from distant caches. I think the main barrier to doing something like that would be that we don't have any sort of CPU/memory affinity right now, and I would worry that doing it strictly based on local/non-local P would interfere too much with existing uses. (Besides, stealing from other CPUs that share an LLC is still pretty cheap compared to stealing from a remote node.)

valyala commented 9 months ago

I don't think putting the problem on the user by adding an API is the right answer. I think it would be really unclear to users when to call Get versus GetLocal, and would likely depend on the properties of the system their code is running on.

The sync.Pool isn't a trivial concept. It already requires some additional qualification for the efficient usage. For example:

So I don't think that adding sync.Pool.GetLocal and sync.Pool.PutLocal will complicate developer's life, who is already aware of potential issues of the sync.Pool mentioned above. On the other hand, it will allow optimizing the cases when the object returned to the pool on the same goroutine where it has been obtained from the pool after performing some CPU-bound work like in the code below:

var p sync.Pool

func f() {
    v := p.Get()
    doSomeCPUBoundWork(v)
    p.Put(v)
}

Then this case can be optimized to:

var p sync.Pool

func f() {
  v := p.GetLocal()
  doSomeCPUBoundWork(v)
  p.PutLocal(v)
}

The code with GetLocal / PutLocal is clear enough to understand what actually it does if you are familiar with the sync.Pool concept.

Jorropo commented 9 months ago

Is there any situation where someone would mix GetLocal and Get calls to the same pool ?

valyala commented 9 months ago

Is there any situation where someone would mix GetLocal and Get calls to the same pool ?

In theory such situations can occur, while they are pretty rare. For example, if the pool is simultaneously used in CPU-bound and IO-bound cases:

var p sync.Pool

func f1() {
  v := p.GetLocal()
  doSomeCPUBoundWork(v)
  p.PutLocal(v)
}

func f2() {
  v := p.Get()
  doSomeIOBoundWork(v)
  p.Put(v)
}

The Get() and Put() can still be used in worker concept, where the object is obtained from the pool in one gorotuine and is returned to the pool in another goroutine:

func f() {
  v := p.Get()
  workerCh <- v
}

func worker() {
  for v := range workerCh {
    doSomeWork(v)
    p.Put(v)
  }
}

var workerCh = func() chan any {
  ch := make(chan any)
  go worker()
  return ch
}()
valyala commented 9 months ago

After closer look at the current sync.Pool implementation, it looks like the IO-bound case also improves when using Pool.GetLocal instead of Pool.Get:

func f() {
  v := p.GetLocal()
  doSomeIOBoundWork(v)
  p.PutLocal(v)
}

var p sync.Pool

If this code is executed concurrently by all the P workers, then it is faster to get P-local object instead of stealing an object from other P caches in the pool - there are high chances that P-local cache already contains an object put there via PutLocal() call. The only valid remaining case for Get() is the worker pattern described above. But even this pattern can be switched to GetLocal() if GOMAXPROCS concurrent workers read from workerCh - this increases chances that an object already exists in the P-local cache.

So, probably, it is good idea to disable stealing objects from other P caches at sync.Pool by default, and, instead, add an ability to explicitly enable object stealing for some rare cases, by setting AllowStealing option at sync.Pool:


var p = &sync.Pool{
  // Explicitly allow stealing an object from other P caches
  AllowStealing: true,
}

func f() {
  v := p.Get()
  workCh <- v
}

var workCh = func() chan any {
  ch := make(chan any)
  go func() {
    for v := range ch {
      doSomeWork(v)
      p.Put(v)
    }
  }()
  return ch
}()

If the AllowStealing option isn't set for this case, then it will degrade to new object allocation at f(). Probably, this is not so bad, since it may still work faster if Go runtime prefers allocating objects from P-local memory. Then the initial access to the allocated object will be faster than the initial access to the object stolen from other P caches in sync.Pool.

valyala commented 9 months ago

It looks like the following idea may automatically cover the worker pattern case without the need to introduce AllowStealing option:

  1. The sync.Pool must contain per-P counter of local cache misses.

Then Get() can be implemented in the following way:

  1. If an object exists in P-local cache, then it is immediately returned and the per-P cache miss counter is reset if it doesn't exceed some threshold N.
  2. If an object isn't found in the local cache, then per-P counter of cache misses is incremented.
  3. If the per-P counter doesn't exceed some threshold N, then nil is returned.
  4. Otherwise an attempt to steal an object from other P caches is performed. The per-P cache miss counter isn't reset even if an object is successfully stolen from other P cache.

This algorithm prefers returning P-local objects and returning nil up until N repeated cache misses, in the hope that the user code returns an object to P-local cache via sync.Pool.Put() soon. If N repeated cache misses are reached, then we can assume that pool is used in some kind of worker pattern. So fall back to object stealing for all the subsequent calls to Get().

The question is which N threshold should be chosen:

someview commented 8 months ago

Good idea. We have met some similiar problem. When put rate is slower than get rate, the sync.pool would hold much memory until next gc. This method v := p.GetLocal() give us more opportunity to control memory usage for pool.