golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.33k stars 17.7k forks source link

proposal: sync: support for sharded values #18802

Open aclements opened 7 years ago

aclements commented 7 years ago

Per-CPU sharded values are a useful and common way to reduce contention on shared write-mostly values. However, this technique is currently difficult or impossible to use in Go (though there have been attempts, such as @jonhoo's https://github.com/jonhoo/drwmutex and @bcmills' https://go-review.googlesource.com/#/c/35676/).

We propose providing an API for creating and working with sharded values. Sharding would be encapsulated in a type, say sync.Sharded, that would have Get() interface{}, Put(interface{}), and Do(func(interface{})) methods. Get and Put would always have to be paired to make Do possible. (This is actually the same API that was proposed in https://github.com/golang/go/issues/8281#issuecomment-66096418 and rejected, but perhaps we have a better understanding of the issues now.) This idea came out of off-and-on discussions between at least @rsc, @hyangah, @RLH, @bcmills, @Sajmani, and myself.

This is a counter-proposal to various proposals to expose the current thread/P ID as a way to implement sharded values (#8281, #18590). These have been turned down as exposing low-level implementation details, tying Go to an API that may be inappropriate or difficult to support in the future, being difficult to use correctly (since the ID may change at any time), being difficult to specify, and as being broadly susceptible to abuse.

There are several dimensions to the design of such an API.

Get and Put can be blocking or non-blocking:

Do could be consistent or inconsistent:

It may be that we can't make this decision at the API level and have to provide both forms of Do.

I think this is a good base API, but I can think of a few reasonable extensions:

aclements commented 7 years ago

My own inclination is towards the non-blocking API with a bounded overflow list. A blocking API seems antithetical to the goal of reducing contention and may lead to performance anomalies if a goroutine or OS thread is descheduled while it has a shard checked out and a non-blocking API with a required combiner may prevent certain use cases (e.g., large structures, or uses that never read the whole sharded value.) It also devolves to the blocking API if the bound is 0.

ianlancetaylor commented 7 years ago

The proposal as written is rather abstract. I think it would help to examine the specific use cases that people have for such a thing.

For example, it's clear that one use case is collecting metrics. Presumably the idea is that you have some sort of server, and it wants to log various metrics for each request. The metrics only need to be accumulated when they are reported, and reporting happens much less often than collection. Using a lock;update;unlock sequence will lead to lock contention. But (let's say) we need the metrics to be accurate. So the idea of sharding for this case is a lock;update;unlock sequence with a sharded lock, and an accumulate step that does lock;collect;zero;unlock for each sharded metric. That gives us the values we need while minimizing lock contention.

One way to implement this use case is for the sync.Sharded to require a combiner method as you describe. Conceptually, then:

func (s *Sharded) Get() interface{} {
    s.LockCurrentShard()
    r := s.CurrentShardValue()
    s.SetCurrentShardValue(nil)
    s.UnlockCurrentShard()
    return r
}

func (s *Sharded) Put(v interface{}) {
    s.LockCurrentShard()
    defer s.UnlockCurrentShard()
    c := s.CurrentShardValue()
    if c == nil {
        s.SetCurrentShardValue(v)
    } else {
        m := s.Combine(c, v) // Combine function defined by user.
        s.SetCurrentShardValue(m)
    }
}

For typical metrics the Do method does not to be consistent. However, it's not hard to have a consistent Do as long as the function passed to Do does not use the sync.Sharded value itself.

With this outline, we see that there is no need for sync.Sharded to maintain an overflow list. Any case that wants to use an overflow list will do so in the Combine function. Obviously the Combine function must not use the sync.Sharded value, as that may lead to deadlock, but otherwise it can do whatever it likes.

What other uses are there for sync.Sharded, and what sorts of implementation do they suggest?

bcmills commented 7 years ago

I had been considering a somewhat narrower API containing only Get and one of {Range, Do, ForEach}, bounding the number of distinct values to the number of threads in the program. The calling code would provide a func() interface{} at construction time to use when Get is called on a thread without an existing value.

The semantics would be similar to the non-blocking proposal: Get returns the current shard's value (but does not guarantee exclusiveness), and Range iterates over all existing values.

Because of the lack of exclusiveness, application code would still have to use atomic and/or sync to manipulate the values, but if the value is uncontended and usually owned by the same core's cache, the overhead of that application-side synchronization would be relatively small (compared to locking overhead for non-sharded values).

That approach two a few advantages over the alternatives in the current proposal.

  1. There is no "overflow list" to manage. The number of values is strictly bounded to the number of threads, and the value for a given thread cannot accidentally migrate away or be dropped.
  2. Application code using atomic values (as for the stats-counter use case in #8281) would not have to deal with lock-ordering (as it would with a blocking Get and Put).
  3. There is no possibility of deadlock (or overallocation of values) due to a missing Put in application code. This is perhaps less significant if we can make trivial defer usage less expensive (#14939) and/or add a vet check (along the lines of lostcancel), but it seems simpler to avoid the problem in the first place.

It has one disadvantage that I'm aware of:

  1. Application code must include its own synchronization code.

Are there other tradeoffs for or against the narrower Get/Range API?

bcmills commented 7 years ago

[Using the "current" value of each shard even if it's checked out] requires that shard values be immutable, but it makes Do non-blocking.

It doesn't even require immutability: "externally synchronized" and/or "atomic" would suffice, although "externally synchronized" carries the risk of lock-ordering issues.

bcmills commented 7 years ago

One way to implement [consistent counting] is for the sync.Sharded to require a combiner method as you describe.

Anything that reduces values seems tricky to get right: you'd have to ensure that Do iterates in an order such that Combine cannot move a value that Do has already iterated over into one that it has yet to encounter (and vice-versa), otherwise you risk under- or double-counting that value.

I don't immediately see how to provide that property for Do in the general case without reintroducing a cross-thread contention point in Put, but it may be possible.

ianlancetaylor commented 7 years ago

For a consistent Do, first lock all the shards, then run the function on each value, then unlock all the shards. For an inconsistent Do, it doesn't matter.

bcmills commented 7 years ago

For a consistent Do, first lock all the shards, then run the function on each value, then unlock all the shards.

That essentially makes Do a stop-the-world operation: it not only blocks all of those threads until Do completes, but also invalidates the cache lines containing the per-shard locks in each of the local CPU caches.

Ideally, Do should produce much less interference in the steady state: it should only acquire/invalidate locks that are not in the fast path of Get and Put. If the values are read using atomic, that doesn't need to invalidate any cache lines at all: the core processing Do might need to wait to receive an up-to-date value, but since there is no write to the cross-core data the existing cached value doesn't need to be invalidated.

I guess that means I'm in favor of an inconsistent Do, provided that we don't discover a very compelling use-case for making it consistent.

funny-falcon commented 7 years ago

For some usages there should be strict knowledge of bounding number of allocated "values", ie number of allocated values should not change. And preferrably, values should be allocated at predictable time, for example, at container (Sharded) creation. For that kind of usage, interface with Put is unuseful.

Probably, it should be separate container:

//NewFixSharded preallocates all values by calling alloc function, and returns new FixSharded.
//FixSharded never changes its size, ie never allocates new value after construction.
NewFixShareded(alloc func() interface) *FixSharded {}
//NewFixShardedN preallocates exactly n values by calling alloc function, and returns new FixSharded.
NewFixSharededN(n int, alloc func() interface) *FixSharded {}
func (a *FixSharded) Get() interface{} {}

If size never changes, there is no need in Do or ForEach or locks. Application code must include its own synchronization code.

Rational: GOMAXPROCS changes rarely (almost never), so dynamic allocation excessive.

I could be mistaken about GOMAXPROCS constantness.

ianlancetaylor commented 7 years ago

@bcmills Well, as I said earlier, I think we need to look at specific use cases. For the specific use case I was discussing, I assert that the cost of a consistent Do is irrelevant, because it is run very infrequently.

What specific use case do you have in mind?

bcmills commented 7 years ago

@ianlancetaylor I'm specifically thinking about counting (as in #8281) and CPU-local caching (e.g. buffering unconditional stores to a shared map, a potential optimization avenue for #18177).

funny-falcon commented 7 years ago

I'm thinking about stat-collectors and high-performance RPC.

ianlancetaylor commented 7 years ago

@bcmills For counting, it seems to me you would use an inconsistent Do. If you need to avoid inconsistency while still using an inconsistent Do, have the combiner store the additional counts elsewhere and not modify the previously stored value. Presumably the combiner is only called in rare cases, so the speed of that rare case is not too important. You could even mitigate that cost by stacking sync.Sharded values.

I don't actually see how to write a consistent Do that does not disturb the fast path of Get and Put at all.

One approach for buffering stores to a shared map would be a Do that removes all the values, replacing them with nil. Come to think it, that would work for counters also. But it does interfere with the fast path.

@funny-falcon Can you expand on what you mean by "high-performance RPC"? I don't see why you need a global distributed value for RPC.

balasanjay commented 7 years ago

Perhaps stating the obvious, but one slightly tricky thing is when to GC stale per-thread values when GOMAXPROCs is decreased.

For some use-cases (e.g. distributed mutexes), they will presumably have a reference keeping the stale values alive.

For others (e.g. counters), you'd need to keep around the value until its been accumulated.

Also, in the pony category: if I want a distributed int64 counter, they would have sufficient padding to avoid false-sharing, but if I allocate multiple such counters, they could be instantiated within the padding, so to speak. I think this could maybe be built in user-space on top of a more low-level API, but if its possible for the API to provide it directly, that'd be great.

funny-falcon commented 7 years ago

@ianlancetaylor I maintain connector to in-memory transactional database capable to serve more than 1M requests per second.

To be able to send that rate of requests, and to be able to scale smoothly with CPU cores, I have to shard internal data structures of connector. (And I need to build custom hash table, and build custom timers. But sharding is a base of improvement). Without sharding there is too many lock contention.

If there will be shard-to-cpu alignment (even if it will be not strict), it will help further reduce lock contention and improve CPU-cache utilization.

As I understood, most of users doesn't change GOMAXPROCS on the fly, so I'd prefer fixed number of preallocated shards, cause then I can easily map responses from server back to shard.

I still think, simple low-level "ProcHint" api (as proposed in https://github.com/golang/go/issues/18590#issuecomment-271718304 ) will be sufficient. But if want for api to look "higher level", then I'd be satisfied with FixSharded.

funny-falcon commented 7 years ago

Link to improved ProcHint api proposal: https://github.com/golang/go/issues/18590#issuecomment-271834288

funny-falcon commented 7 years ago

Excuse me for a bit offtopic: How often GOMAXPROCS changed at runtime in production workloads? What patterns of this change exists?

rsc commented 7 years ago

Programs might change GOMAXPROCS in response to getting more or less of a machine as co-tenancy changes.

Sajmani commented 7 years ago

I'll document some concrete use case examples:

1) Scalar counters incremented in Go server request handlers

func serveRequest(...) {
  requestCounter.Add(1)
}
func serveCounter(w responseWriter, ...) {
  w.Print(requestCounter.Count())
}

I believe with @aclements 's API we would implement this as:

func (c *counter) Add(n int) {
  if v, ok := c.shards.Get().(int); ok {
    v += n
    c.shards.Put(v)
  } else {
    c.shards.Put(n)
  }
}
func (c *counter) Count() (v int) {
  c.shards.Do(func(shard interface{}) {
    v += shard.(int)
  })
  return v
}

2) Non-scalar (map) metrics in Go server request handlers

func serveRequest(...) {
  user := userFromRequest(req)
  userRequestCounter.Add(1, user)
}
func serveCounter(w responseWriter, ...) {
  w.Print(userRequestCounter.Map())
}

I believe with @aclements 's API we would implement this as:

func (c *mapCounter) Add(n int, key string) {
  if m, ok := c.shards.Get().(map[string]int); ok {
    m[key] += n
    c.shards.Put(m)
  } else {
    c.shards.Put(map[string]int{key: n})
  }
}
func (c *mapCounter) Map() map[string]int {
  m := make(map[string]int)
  c.shards.Do(func(shard interface{}) {
    for key, count := range shard.(map[string]int) {
      m[key] += count
    }
  })
  return m
}

@aclements does that all look right?

bradfitz commented 7 years ago

we would implement this as

if v, ok := c.shards.Get().(int); ok {
v += n
c.shards.Put(v)
} else {
c.shards.Put(n)
}

... allocating an integer for every increment? (ints into interfaces cause an allocation)

bcmills commented 7 years ago

And my experience with sync.Pool is that calling Put with the type-asserted value tends to introduce an extra allocation too (for the interface{} value).

So we'd probably actually want to write it as:

p := c.shards.Get()
if p == nil {
  p = new(int)
}
*(p.(*int)) += n
c.shards.Put(p)

(Or else we'll want to fix the extra allocation for the Put call through some compiler optimization.)

jonhoo commented 7 years ago

I wonder if this could also be used to build a better multi-core RWLock similar to my drwmutex. From the proposal thus far, it sound like it might be tricky to implement something like "as a writer, take all locks, and disallow new locks to be added while you hold those locks".

bcmills commented 7 years ago

@jonhoo

it sounds like it might be tricky to implement something like "as a writer, take all locks, and disallow new locks to be added while you hold those locks".

Tricky but possible, I think. You can add a sync.Mutex to be acquired in the exclusive path and when adding new locks, but you have to be careful about lock-ordering.

The harder part is that if you want to satisfy the existing sync.RWMutex API you have to handle the possibility that the RUnlock call occurs on a different thread from the RLock. One thing you could do is keep a locked bool on each of the sharded values and add a slow-path fallback for the case where your thread's read-lock isn't the one you locked.

A sketch with the blocking version of Get and Put:


type readerLock struct {
  locked bool
  mu sync.Mutex
}

func (m *RWMutex) RLock() {
  i := m.readers.Get()
  l, _ := i.(*readerLock)
  if l != nil && !l.locked {
    l.Lock()
    return
  }
  if l.locked {
    m.readers.Put(i)  // Put this one back and allocate a new one.
  }

  l = &readerLock{locked: true}
  l.Lock()
  m.add.Lock()
  m.readers.Put(l)
  m.add.Unlock()
}

func (m *RWMutex) RUnlock() {
  i := m.readers.Get()
  l, _ := i.(*readerLock)
  if l != nil && l.locked {
    l.Unlock()
    return
  }
  unlocked := false
  m.readers.Do(func(i interface{}) {
    if unlocked {
      return
    }
    l := i.(*readerLock)
    if l.locked {
      l.Unlock()
      unlocked = true
    }
  })
}
funny-falcon commented 7 years ago

Technique used by @valyala to improve timers at https://go-review.googlesource.com/#/c/34784/ exactly shows why runtime.ProcHint() (or Pid()) is useful for high performance multi-core programming.

I agree with @valyala that ProcMaxHint most likely doesn't need. It is enough to have fixed size number of "shards".

So, even if you decided to stick with sync.Sharded, then, please, add also sync.FixSharded with preallocated "values".

bcmills commented 7 years ago

That approach looks like it would be just as easy to implement (and perhaps with less cross-core contention) using Sharded.

Note that there's nothing stopping a user of the proposed Sharded API from using a fixed set of values with it. FixSharded is redundant.

funny-falcon commented 7 years ago

@bcmills, well, yes: if allocator function returns pointers to preallocated values, then it looks like FixSharded. You are right.

robpike commented 7 years ago

Please don't call this sharding, which is either an obscure English term or a term of art for distributed computing. Neither fits. It's a per-CPU thing, so call it something like perCPU or percpu.

Sajmani commented 7 years ago

I find sharding to be a useful term here since the operations on these shards are similar to those we see in parallel data processing systems like mapreduce: shards are partitions of a larger dataset; they can be operated on in parallel; they can be combined into fewer shards; and they can be reduced into a final value. But this may just be my distributed systems background influencing the way I think about this problem. If the only meaningful implementation of this functionality is one value per CPU, then naming it PerCPU is fine. On Mon, Jan 30, 2017 at 4:12 PM Rob Pike notifications@github.com wrote:

Please don't call this sharding, which is either an obscure English term or a term of art for distributed computing. Neither fits. It's a per-CPU thing, so call it something like perCPU or percpu.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18802#issuecomment-276192022, or mute the thread https://github.com/notifications/unsubscribe-auth/AJSK3ShVkC1_aZopzn9i24K9DqIn0Spxks5rXlJVgaJpZM4Lu_e6 .

robpike commented 7 years ago

If the functional idea is that it should be per CPU or per thread, then its name should reflect that. Sharding says nothing about the axis on which the item is split, but as I understand this proposal its point is precisely to achieve the particular storage properties of per-CPU/thread memory, and its name should reflect that precision.

rsc commented 7 years ago

It's actually not per-CPU or per-thread. It's just split up enough to avoid contention across the goroutines that are using it. Exactly how it is split up depends on internal details as well as how many goroutines need one at the same time. Maybe SplitValue?

rsc commented 7 years ago

Per @aclements, this proposal is stuck because no one can agree on what problem is most important to solve. Putting on hold until that's resolved.

funny-falcon commented 7 years ago

oooouugggghhhhh :-( I'm sad panda. Then may be just expose runtime.Pid() ? and every one, who need this feature, will solve its problems in its own way? I know, you will not agree :-(

Sajmani commented 7 years ago

@rsc @aclements I just want to be able to count fast. The critical use case is any global state like metrics or other counters that are incremented frequently, from many goroutines, and read infrequently. Go code can create sharded counters already using an array, but there's no locality between the goroutine and the shard it selects. The key evaluation for whether this work is successful is increased RPC throughput in services that maintain per-request metrics.

bradfitz commented 7 years ago

Assigning to @aclements to post a summary here.

rhysh commented 7 years ago

I ran into a similar problem recently, and also solved it by abusing sync.Pool to estimate the current P's id to reduce contention. My use-case is a bit different from the ones listed so far, but my approach to address it has been similar.

We have some high-volume RPC servers that need to emit some information about each request as UDP packets. It's not practical to create a new UDP socket for each request, and sharing a single socket for the whole process results in a lot of CPU cycles spent in internal/poll.(*fdMutex).rwlock. Writing to a single shared []byte buffer doesn't work either because there'll be a lot of time spent in sync.(*Mutex).Lock (which got much worse after upgrading to Go 1.9). Using locking rather than channels means there's no way for a would-be user to time out and give up if they've been waiting too long. For my use-case, writing directly to a UDP connection is best so we don't have to worry about losing data if the process exits quickly or crashes.

For my use case: 1) Creating a new resource takes time (could require a DNS lookup if we're not careful), and can result in an error 2) The resource needs to be cleaned up if the "pool" size changes 3) Using the resource can result in an error 4) Would-be users may want to give up if they'd need to wait too long 5) Users would prefer access to any resource if it can happen quickly, rather than waiting for the resource "assigned" to this P. Locality certainly improves performance, but forcing locality leads to high variance in the performance when there are many concurrent users.

My workaround is to create 2*GOMAXPROCS UDP connections, and to call an API like this:

func(ctx context.Context, fn func(n int) error) error

The provided closure is called once (unless the context expires), and while it runs has exclusive access to the resource at offset n. (Or the resource could be passed directly via interface{}.)

My workaround doesn't implement my entire wishlist (it doesn't steal resources from other Ps, it doesn't deal with variable GOMAXPROCS), but it did make my package completely disappear from the execution tracer's synchronous blocking profile, and it doesn't suffer from sync.Pool's tendency to create a zillion resources in response to a burst in demand.

funny-falcon commented 7 years ago

@rhysh How did you solve cleaning sync.Pool on GC?

bcmills commented 7 years ago

@funny-falcon You can work around it by (ab)using finalizers, as illustrated in https://golang.org/cl/35676.

ianlancetaylor commented 7 years ago

@rhysh What do you think of #17520?

rhysh commented 7 years ago

@funny-falcon The sync.Pool entries are structs that include an index and a channel. The channel is used as a semaphore for handing off the exclusive right to use the resource, and the index describes which resource to use. The sync.Pool gets cleared at every GC, but we can easily reconstruct new equivalent values to put in the pool; an atomically-accessed counter will give us a new index, mod the (fixed) number of resources we have. This means there's a bit of garbage, but it doesn't require finalizers.

@ianlancetaylor Faster UDP writes would help my application a lot. Is it reasonable to expect a single UDP connection to support the needs of an whole server, both as far as Go and the OS are concerned?

ianlancetaylor commented 7 years ago

@rhysh For Go, each access to the UDP connection would require an atomic compare-and-swap, so sharing a single UDP connection with many goroutines will tend to share that memory location to many cores. I would think that would be fairly inconsequential compared to the cost of actually writing to the socket, but I don't actually know.

For the OS, I have no idea.

cespare commented 7 years ago

It would be great to reach consensus about what problem is most important to solve so that this proposal can be unstuck. Maybe adding our experience will help.

At our company we have the exact issue that @Sajmani describes in https://github.com/golang/go/issues/18802#issuecomment-294529939: we need servers running on many-cpu systems to be able to increment counters with low overhead.

In the meantime, I'm perusing the buffet of hacky/unsafe workarounds.

Sajmani commented 7 years ago

Hi Caleb, Do you have any benchmarks that show the precise nature of your issue and the workarounds? Per Russ's earlier comment, "this proposal is stuck because no one can agree on what problem is most important to solve." If you have a benchmark, that will at least identify one important problem to solve (and allow us to evaluate a solution). Thanks, S

On Mon, Nov 13, 2017 at 4:36 PM Caleb Spare notifications@github.com wrote:

It would be great to reach consensus about what problem is most important to solve so that this proposal can be unstuck. Maybe adding our experience will help.

At our company we have the exact issue that @Sajmani https://github.com/sajmani describes in #18802 (comment) https://github.com/golang/go/issues/18802#issuecomment-294529939: we need servers running on many-cpu systems to be able to increment counters with low overhead.

In the meantime, I'm perusing the buffet of hacky/unsafe workarounds.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/18802#issuecomment-344106138, or mute the thread https://github.com/notifications/unsubscribe-auth/AJSK3VKeuFWfxf5RYtH-iAUHQvRuxbjRks5s2OB5gaJpZM4Lu_e6 .

bcmills commented 6 years ago

Another use-case for per-thread / sharded values relates to https://github.com/golang/go/issues/20387.

Libraries may need to use their own rand.Source to avoid fragile assumptions about the seeding (or lack thereof) of the default Source, but individual rand.Source instances require locking, which could introduce a lot of contention in rand-heavy code.

cespare commented 6 years ago

@Sajmani Here's a demo: https://github.com/cespare/misc/blob/master/fastcount/fastcount.go.

It's not really all that interesting -- it basically demonstrates what we all know about contention on multi-core CPUs. But maybe it will give us something concrete to discuss.

As one example from our internal code, we have a program doing performance-critical CPU-bound work. It uses atomic counters for recording details about certain operations. The program currently runs on 36-core machines and we may move it to larger hardware soon.

With this program, we need to be careful what we count, because adding counters into the fast path is surprisingly slow (hundreds of ns). In the past, adding counters in the critical path has caused unacceptable throughput regressions.

My demo program shows this sort of simple counter implemented in three different ways:

  1. with a sync.Mutex
  2. as an int64 modified via atomic.AddInt64 and friends
  3. as a per-CPU sharded set of int64s where we use the RDTSCP instruction from assembly to get the current core ID

Given a flavor of counter, it simply spins up as many goroutines as it has available CPUs and tries to count as fast as possible. If we restrict it to a single CPU, they're all pretty cheap:

$ numactl -C 0 ./fastcount -type mutex
num workers: 1
57466000 incs/sec; 57466000 incs/sec/worker; avg latency 17ns
57712000 incs/sec; 57712000 incs/sec/worker; avg latency 17ns
57410000 incs/sec; 57410000 incs/sec/worker; avg latency 17ns
^C
$ numactl -C 0 ./fastcount -type atomic
num workers: 1
135590000 incs/sec; 135590000 incs/sec/worker; avg latency 7ns
136088000 incs/sec; 136088000 incs/sec/worker; avg latency 7ns
135568000 incs/sec; 135568000 incs/sec/worker; avg latency 7ns
^C
$ numactl -C 0 ./fastcount -type rdtscp
num workers: 1
45226000 incs/sec; 45226000 incs/sec/worker; avg latency 22ns
45424000 incs/sec; 45424000 incs/sec/worker; avg latency 22ns
45178000 incs/sec; 45178000 incs/sec/worker; avg latency 22ns
^C

If we run on all 72 cores on my test server, mutexes and atomics scale terribly:

$ ./fastcount -type mutex
num workers: 72
10575990 incs/sec; 146889 incs/sec/worker; avg latency 6.807µs
10628817 incs/sec; 147622 incs/sec/worker; avg latency 6.774µs
10651535 incs/sec; 147938 incs/sec/worker; avg latency 6.759µs
^C
$ ./fastcount -type atomic
num workers: 72
46358625 incs/sec; 643870 incs/sec/worker; avg latency 1.553µs
46468854 incs/sec; 645401 incs/sec/worker; avg latency 1.549µs
46242229 incs/sec; 642253 incs/sec/worker; avg latency 1.557µs
^C
$ ./fastcount -type rdtscp
num workers: 72
2936379161 incs/sec; 40783044 incs/sec/worker; avg latency 24ns
2998082182 incs/sec; 41640030 incs/sec/worker; avg latency 24ns
2971579024 incs/sec; 41271931 incs/sec/worker; avg latency 24ns
^C

The mutex increment was 398x slower and the atomic increment was 222x slower.

Here's a chart which compares the latency of these three counter types when restricted to various numbers of CPUs. Note the log10 scale; Google sheets didn't want to draw a bar for the value of 7 in that first group for some reason.

screen shot 2018-03-08 at 1 26 02 am

Also, note that my RDTSCP-based implementation has quite a bit of overhead which a runtime-integrated solution doesn't have. RDTSCP is an expensive instruction and the assembly function cannot be inlined. However, the runtime can quickly look up the current P's ID. I expect that a sharded counter with the runtime's help like that should have an increment latency of <10ns on my hardware.

dvyukov commented 6 years ago

/sub

There was also this proposal with counters, pools and scalable mutexes: https://codereview.appspot.com/4850045/

balasanjay commented 6 years ago

I spent some time thinking about this issue this week, and I eventually concluded that @bcmills's initial comment is spot on.

That API addresses all of the use-cases mentioned here, it should be relatively straightforward to implement, and the implementation should be very fast (I counted 4 atomic loads in the fast path once its warm).

I wrote up a more long-form proposal and will send it out on Monday (I left it on a work machine that appears to have dropped off the network).

gopherbot commented 6 years ago

Change https://golang.org/cl/118135 mentions this issue: design: add 18802-percpu-sharded.md

balasanjay commented 6 years ago

Alright, sent a concrete proposal doc to the proposal repo. Happy to move it elsewhere, if people prefer.

balasanjay commented 6 years ago

Proposal available here: https://github.com/golang/proposal/blob/master/design/18802-percpu-sharded.md.

Comments welcome.

philhofer commented 6 years ago

So, Paul McKenney's "Is Parallel Programming Hard, And, If So, What Can You Do About It?" has an entire chapter called "counting," and it outlines five or six different use cases for concurrent counters and a very different implementation for each use case. And that's just counters. So when @Sajmani says "I just want to count fast," it could mean one of a number of things in practicality.

I don't think there is a design that covers a wide number of use cases while simultaneously satisfying the Go team's desire to hide footguns from Go users. For instance, if you had access to a goroutine "id" and the ability to turn preemption on or off, you'd probably have enough to implement RCU lists and trees, concurrent counters of various flavors, and more. But I have a lot of trouble imagining this implementation of Go ever exposing those features to users, even in spite of the fact that folks like @cespare have to resort to home-cooked solutions that are equally dangerous.

@balasanjay 's design is basically McKenney's "array-based statistical counters" (the relaxed variety). The "order independent accumulators" section really wants a faster channel implementation (which I remember @dvyukov implementing and then deciding not to merge), and the "RWLock" section really wants RCU.

I guess what I'm saying is that I think the design should just be atomic.Counter, and then we should come up with separate solutions for the "RWLock" and "order independent accumulator" cases.

funny-falcon commented 6 years ago

@philhofer, no. I need 'sharded' for RPC client: there are N reusable byte-buffers to serialize requests to, and I need to contend as little as possible on them. Without 'sharded' I need:

Now I use "pure man sharding", ie just atomic counter into array of shards. But real "sharded" will be much better, because it will lead to better byte-buffer utilization beside lesser contention.