proposal: sync: add PLocalCache

valyala commented 3 months ago

Proposal Details

The issue

High-performance code which scales linearly with the number of CPU cores usually needs per-CPU caches for holding some per-CPU state in order to avoid costly inter-CPU synchronization. The state can be re-computed at any time, but the computation may take additional CPU time and other resources, so it is more efficient to cache the computed state per each CPU core and then re-use it.

The sync.Pool can be used as per-CPU cache, but it has the following issues in this context:

sync.Pool.Get() tries stealing an object from other CPUs if the object is missing in the current P. This leads to costly inter-CPU synchronization. The cost of this synchronization increases with the number of available CPU cores.
sync.Pool.Put() may store multiple objects at the same P. This leads to excess memory usage when at most one object is needed per P. sync.Pool.Put() also triggers expensive inter-CPU synchronization if P already contains an object.
sync.Pool may drop cached objects at every GC cycle, so the caller needs to spend additional CPU time for re-creating the object.

The solution

To add sync.PLocalCache struct with the following API:

// PLocalCache caches per-P objects.
//
// Every P may have at most one object at any moment.
//
// PLocalCache is useful for implementing linearly scalable
// CPU-bound algorithms on multi-CPU architectures.
type PLocalCache struct { ... }

// Get returns P-local object from c.
//
// nil is returned if there is no cached object for the current P.
//
// It is guaranteed that the returned object can be accessed only by the current goroutine,
// so there is no need in synchronizing access to it.
//
// Return the object to the cache via Put() if it needs to be re-used next time.
func (c *PLocalCache) Get() any { ... }

// Put puts v to P-local storage at c.
//
// v mustn't be used after Put() call.
//
// There is no guarantee that the subsequent Get() call returns v.
func (c *PLocalCache) Put(v any) { ... }

Implementation details

sync.PLocalCache may be implemented in the way similar to sync.Pool, but without the following abilities:

Stealing objects from other Ps. If the object is missing in P-local storage, then just return nil.
Ability to put multiple objects per every P-local storage. If Put() is called on the storage with already existing P-local object, then just ignore the new object.
Periodic cleanup on every GC cycle. It is guaranteed that the number of cached objects in sync.PLocalCache doesn't exceed GOMAXPROCS, e.g. it is bounded, and it is expected that the user continuously accesses the cached objects. So there is little sense in periodic cleanup of the cache. All the cached objects will be removed after the corresponding sync.PLocalCache is destroyed by garbage collector.

The property of having at most one P-local object in the cache narrows down the applicability of the Get() ... Put() to CPU-bound code without context switches (e.g. without IO, expensive syscalls and CGO calls). This minimizes chances of context switch during the execution of the code between Get() and Put(), so the cached objects will be successfully re-used by this code. For example, it is great to use sync.PLocalCache for scalable random number generator with per-P (aka per-CPU) state. It is also great to use sync.PLocalCache for various CPU-bound parsers, encoders and compressors with some non-trivial state, which can be cached in the P-local cache.

On the other hand, if the chances of context switch between Get() and Put() calls are high, then this increases chances that Get() will return nil most of the time. This forces the user's code to spend additional CPU time on object re-creation. The re-created object will be dropped most of the time on Put() call, since there are high chances that there is another P-local object is put in the cache by concurrently running goroutines. In such cases it is better to use sync.Pool instead of sync.PLocalCache.

Example usage

type ScalableParser struct {
  state sync.PLocalCache
}

func (p *ScalableParser) Parse(s string) result {
  v := p.state.Get()
  if v == nil {
    v = newScalableParserState()
  }
  ps := v.(*scalableParserState)
  r := ps.Parse(s)
  p.state.Put(v)
  return r
}

See also https://github.com/golang/go/issues/65104 . Now I think it is better to provide a separate entity with clear semantics than complicating the semantics of sync.Pool and/or trying to efficiently cover multiple different cases with sync.Pool.

Generics

It may be good providing generic-based sync.PLocalCache[T any], but it is OK to provide non-generic implementation to be consistent with sync.Pool.

gabyhelp commented 3 months ago

Related Issues and Documentation

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

ianlancetaylor commented 3 months ago

Sounds like a dup of #8281.

mknyszek commented 2 months ago

This idea is interesting, but I think it stops just short of being able to cover many more use-cases, without much of an increase in complexity.

For example, with a hook into where two values are in conflict (similar to the Merge method here: https://github.com/golang/go/issues/18802#issuecomment-1884012504), and iteration over all the values, we can support additional use-cases like scalable counters and logging (locally cache buffers, and on conflict, just flush one of them).

Some other notes:

I don't think we should make the concept of a P part of any public API; it's a concept that's internal to the runtime. I don't have a better name off the top of my head, but I'm sure there's a good one out there.
I don't think there's a reason not to use generics for new APIs where it's appropriate (and I do think it's appropriate here). You can always make any your type parameter if that's what you want.

I think this proposal is closely related to #18802 actually, and not far from where I wanted to explore next with being able to access local values without synchronization.

valyala commented 2 months ago

For example, with a hook into where two values are in conflict (similar to the Merge method here: https://github.com/golang/go/issues/18802#issuecomment-1884012504), and iteration over all the values, we can support additional use-cases like scalable counters and logging (locally cache buffers, and on conflict, just flush one of them).

I'd prefer leaving p-local cache as simple as possible, so it solves the original issue described above - to provide an efficient and cpu-scalable mechanism for caching the state of various CPU-bound parsers and encoders. It is expected that the state may be lost at any time (for example, when GOMAXPROCS changes, when the goroutine is re-scheduled to other P or when somebody forgets returning the state to the cache), so it could be re-constructed when needed.

Other use cases like https://github.com/golang/go/issues/8281 and https://github.com/golang/go/issues/18802 should be covered separately, since they are more complicated and they have no clear solution yet.

golang / go