golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
121.07k stars 17.36k forks source link

proposal: add container/queue #27935

Open christianrpetrin opened 5 years ago

christianrpetrin commented 5 years ago

Proposal: Built in support for high performance unbounded queue

Author: Christian Petrin.

Last updated: November 26, 2018

Discussion at: https://github.com/golang/go/issues/27935

Design document at https://github.com/golang/proposal/blob/master/design/27935-unbounded-queue-package.md

Abstract

I propose to add a new package, "container/queue", to the standard library to support an in-memory, unbounded, general purpose queue implementation.

Queues in computer science is a very old, well established and well known concept, yet Go doesn't provide a specialized, safe to use, performant and issue free unbounded queue implementation.

Buffered channels provide an excellent option to be used as a queue, but buffered channels are bounded and so doesn't scale to support very large data sets. The same applies for the standard ring package.

The standard list package can be used as the underlying data structure for building unbounded queues, but the performance yielded by this linked list based implementation is not optimal.

Implementing a queue using slices as suggested here is a feasible approach, but the performance yielded by this implementation can be abysmal in some high load scenarios.

Background

Queues that grows dynamically has many uses. As an example, I'm working on a logging system called CloudLogger that sends all logged data to external logging management systems, such as Stackdriver and Cloudwatch. External logging systems typically rate limit how much data their service will accept for a given account and time frame. So in a scenario where the hosting application is logging more data than the logging management system will accept at a given moment, CloudLogger has to queue the extra logs and send them to the logging management system at a pace the system will accept. As there's no telling how much data will have to be queued as it depends on the current traffic, an unbounded, dynamically growing queue is the ideal data structure to be used. Buffered channels in this scenario is not ideal as they have a limit on how much data they will accept, and once that limit has been reached, the producers (routines adding to the channel) start to block, making the adding to the channel operation an "eventual" synchronous process. A fully asynchronous operation in this scenario is highly desirable as logging data should not slow down significantly the hosting application.

Above problem is a problem that, potentially, every system that calls another system faces. And in the cloud and microservices era, this is an extremely common scenario.

Due to the lack of support for built in unbounded queues in Go, Go engineers are left to either: 1) Research and use external packages, or 2) Build their own queue implementation

Both approaches are riddled with pitfalls.

Using external packages, especially in enterprise level software, requires a lot of care as using external, potentially untested and hard to understand code can have unwanted consequences. This problem is made much worse by the fact that, currently, there's no well established and disseminated open source Go queue implementation according to this stackoverflow discussion, this github search for Go queues and Awesome Go.

Building a queue, on the other hand, might sound like a compelling argument, but building efficient, high performant, bug free unbounded queue is a hard job that requires a pretty solid computer science foundation as well a good deal of time to research different design approaches, test different implementations, make sure the code is bug and memory leak free, etc.

In the end what Go engineers have been doing up to this point is building their own queues, which are for the most part inefficient and can have disastrous, yet hidden performance and memory issues. As examples of poorly designed and/or implemented queues, the approaches suggested here and here (among many others), requires linear copy of the internal slice for resizing purposes. Some implementations also has memory issues such as an ever expanding internal slice and memory leaks.

Proposal

I propose to add a new package, "container/queue", to the standard library to support in-memory unbounded queues. The proposed queue implementation offers excellent performance and very low memory consumption when comparing it to three promising open source implementations (gammazero, phf and juju); to use Go channels as queue; the standard list package as a queue as well as six other experimental queue implementations.

The proposed queue implementation offers the most balanced approach to performance given different loads, being significantly faster and still uses less memory than every other queue implementation in the tests.

The closest data structure Go has to offer for building dynamically growing queues for large data sets is the standard list package. When comparing the proposed solution to using the list package as an unbounded queue (refer to "BenchmarkList"), the proposed solution is consistently faster than using the list package as a queue as well as displaying a much lower memory footprint.

Reasoning

There's two well accepted approaches to implementing queues when in comes to the queue underlying data structure:

1) Using linked list 2) Using array

Linked list as the underlying data structure for an unbounded queue has the advantage of scaling efficiently when the underlying data structure needs to grow to accommodate more values. This is due to the fact that the existing elements doesn't need to be repositioned or copied around when the queue needs to grow.

However, there's a few concerns with this approach: 1) The use of prev/next pointers for each value requires a good deal of extra memory 2) Due to the fact that each "node" in the linked list can be allocated far away from the previous one, navigating through the list can be slow due to its bad memory locality properties 3) Adding new values always require new memory allocations and pointers being set, hindering performance

On the other hand, using a slice as the underlying data structure for unbounded queues has the advantage of very good memory locality, making retrieval of values faster when comparing to linked lists. Also an "alloc more than needed right now" approach can easily be implemented with slices.

However, when the slice needs to expand to accommodate new values, a well adopted strategy is to allocate a new, larger slice, copy over all elements from the previous slice into the new one and use the new one to add the new elements.

The problem with this approach is the obvious need to copy all the values from the older, small slice, into the new one, yielding a poor performance when the amount of values that need copying are fairly large.

Another potential problem is a theoretical lower limit on how much data they can hold as slices, like arrays, have to allocate its specified positions in sequential memory addresses, so the maximum number of items the queue would ever be able to hold is the maximum size a slice can be allocated on that particular system at any given moment. Due to modern memory management techniques such as virtual memory and paging, this is a very hard scenario to corroborate thru practical testing.

Nonetheless, this approach doesn't scale well with large data sets.

Having said that, there's a third, newer approach to implementing unbounded queues: use fixed size linked slices as the underlying data structure.

The fixed size linked slices approach is a hybrid between the first two, providing good memory locality arrays have alongside the efficient growing mechanism linked lists offer. It is also not limited on the maximum size a slice can be allocated, being able to hold and deal efficiently with a theoretical much larger amount of data than pure slice based implementations.

Rationale

Research

A first implementation of the new design was built.

The benchmark tests showed the new design was very promising, so I decided to research about other possible queue designs and implementations with the goal to improve the first design and implementation.

As part of the research to identify the best possible queue designs and implementations, I implemented and probed a total of 7 experimental queue implementations. Below are a few of the most interesting ones.

Also as part of the research, I investigated and probed below open source queue implementations as well.

The standard list package as well as buffered channels were probed as well.

Benchmark Results

Add and remove 100 items
Performance
![ns/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-100-items-perf-main.jpg?raw=true "Benchmark tests")

Memory
![B/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-100-items-mem-main.jpg?raw=true "Benchmark tests")

Add and remove 100k items
Performance
![ns/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-100k-items-perf-main.jpg?raw=true "Benchmark tests")

Memory
![B/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-100k-items-mem-main.jpg?raw=true "Benchmark tests")

Aggregated Results
Performance
![ns/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-line-perf-main.jpg?raw=true "Benchmark tests")

Memory
![B/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-line-mem-main.jpg?raw=true "Benchmark tests")

Detailed, curated results can be found here

Aggregated, curated results can be found here

Given above results, queueimpl7, henceforth just "impl7", proved to be the most balanced implementation, being either faster or very competitive in all test scenarios from a performance and memory perspective.

Refer here for more details about the tests.

The benchmark tests can be found here.

Impl7 Design and Implementation

Impl7 was the result of the observation that some slice based implementations such as queueimpl1 and queueimpl2 offers phenomenal performance when the queue is used with small data sets.

For instance, comparing queueimpl3 (very simple linked slice implementation) with queueimpl1 (very simple slice based implementation), the results at adding 0 (init time only), 1 and 10 items are very favorable for impl1, from a performance and memory perspective.

benchstat rawresults/bench-impl1.txt rawresults/bench-impl3.txt
name       old time/op    new time/op    delta
/0-4         6.83ns ± 3%  472.53ns ± 7%   +6821.99%  (p=0.000 n=20+17)
/1-4         48.1ns ± 6%   492.4ns ± 5%    +924.66%  (p=0.000 n=20+20)
/10-4         532ns ± 5%     695ns ± 8%     +30.57%  (p=0.000 n=20+20)
/100-4       3.19µs ± 2%    2.50µs ± 4%     -21.69%  (p=0.000 n=18+19)
/1000-4      24.5µs ± 3%    23.6µs ± 2%      -3.33%  (p=0.000 n=19+19)
/10000-4      322µs ± 4%     238µs ± 1%     -26.02%  (p=0.000 n=19+18)
/100000-4    15.8ms ±10%     3.3ms ±13%     -79.32%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          0.00B       2080.00B ± 0%       +Inf%  (p=0.000 n=20+20)
/1-4          16.0B ± 0%   2080.0B ± 0%  +12900.00%  (p=0.000 n=20+20)
/10-4          568B ± 0%     2152B ± 0%    +278.87%  (p=0.000 n=20+20)
/100-4       4.36kB ± 0%    2.87kB ± 0%     -34.13%  (p=0.000 n=20+20)
/1000-4      40.7kB ± 0%    24.6kB ± 0%     -39.54%  (p=0.000 n=20+20)
/10000-4      746kB ± 0%     244kB ± 0%     -67.27%  (p=0.000 n=20+20)
/100000-4    10.0MB ± 0%     2.4MB ± 0%     -75.85%  (p=0.000 n=15+20)

name       old allocs/op  new allocs/op  delta
/0-4           0.00           2.00 ± 0%       +Inf%  (p=0.000 n=20+20)
/1-4           1.00 ± 0%      2.00 ± 0%    +100.00%  (p=0.000 n=20+20)
/10-4          14.0 ± 0%      11.0 ± 0%     -21.43%  (p=0.000 n=20+20)
/100-4          108 ± 0%       101 ± 0%      -6.48%  (p=0.000 n=20+20)
/1000-4       1.01k ± 0%     1.01k ± 0%      +0.50%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%      +1.35%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%      +1.53%  (p=0.000 n=20+20)

Impl7 is a hybrid experiment between using a simple slice based queue implementation for small data sets and the fixed size linked slice approach for large data sets, which is an approach that scales really well, offering really good performance for small and large data sets.

The implementation starts by lazily creating the first slice to hold the first values added to the queue.

const (
    // firstSliceSize holds the size of the first slice.
    firstSliceSize = 1

    // maxFirstSliceSize holds the maximum size of the first slice.
    maxFirstSliceSize = 16

    // maxInternalSliceSize holds the maximum size of each internal slice.
    maxInternalSliceSize = 128
)

...

// Push adds a value to the queue.
// The complexity is amortized O(1).
func (q *Queueimpl7) Push(v interface{}) {
    if q.head == nil {
        h := newNode(firstSliceSize) // Returns a 1-sized slice.
        q.head = h
        q.tail = h
        q.lastSliceSize = maxFirstSliceSize
    } else if len(q.tail.v) >= q.lastSliceSize {
        n := newNode(maxInternalSliceSize) // Returns a 128-sized slice.
        q.tail.n = n
        q.tail = n
        q.lastSliceSize = maxInternalSliceSize
    }

    q.tail.v = append(q.tail.v, v)
    q.len++
}

...

// newNode returns an initialized node.
func newNode(capacity int) *Node {
    return &Node{
        v: make([]interface{}, 0, capacity),
    }
}

The very first created slice is created with capacity 1. The implementation allows the builtin append function to dynamically resize the slice up to 16 (maxFirstSliceSize) positions. After that it reverts to creating fixed size 128 position slices, which offers the best performance for data sets above 16 items.

16 items was chosen as this seems to provide the best balanced performance for small and large data sets according to the array size benchmark tests. Above 16 items, growing the slice means allocating a new, larger one and copying all 16 elements from the previous slice into the new one. The append function phenomenal performance can only compensate for the added copying of elements if the data set is very small, no more than 8 items in the benchmark tests. For above 8 items, the fixed size slice approach is consistently faster and uses less memory, where 128 sized slices are allocated and linked together when the data structure needs to scale to accommodate new values.

Why 16? Why not 15 or 14?

The builtin append function, as of "go1.11 darwin/amd64", seems to double the slice size every time it needs to allocate a new one.

ts := make([]int, 0, 1)

ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 1 item; output: 1

ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 2 items; output: 2

ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 3 items; output: 4

ts = append(ts, 1)
ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 5 items; output: 8

ts = append(ts, 1)
ts = append(ts, 1)
ts = append(ts, 1)
ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 9 items; output: 16

Since the append function will resize the slice from 8 to 16 positions, it makes sense to use all 16 already allocated positions before switching to the fixed size slices approach.

Design Considerations

Impl7 uses linked slices as its underlying data structure.

The reason for the choice comes from two main observations of slice based queues: 1) When the queue needs to expand to accommodate new values, a new, larger slice needs to be allocated and used 2) Allocating and managing large slices is expensive, especially in an overloaded system with little available physical memory

To help clarify the scenario, below is what happens when a slice based queue that already holds, say 1bi items, needs to expand to accommodate a new item.

Slice based implementation

The same scenario for impl7 plays out like below.

Impl7

Impl7 never copies data around, but slice based ones do, and if the data set is large, it doesn't matter how fast the copying algorithm is. The copying has to be done and will take some time.

The decision to use linked slices was also the result of the observation that slices goes to great length to provide predictive, indexed positions. A hash table, for instance, absolutely need this property, but not a queue. So impl7 completely gives up this property and focus on what really matters: add to end, retrieve from head. No copying around and repositioning of elements is needed for that. So when a slice goes to great length to provide that functionality, the whole work of allocating new arrays, copying data around is all wasted work. None of that is necessary. And this work costs dearly for large data sets as observed in the tests.

Impl7 Benchmark Results

Below compares impl7 with a few selected implementations.

The tests name are formatted given below.

Examples:


Standard list used as a FIFO queue vs impl7.

benchstat rawresults/bench-list.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         34.9ns ± 1%     1.2ns ± 3%   -96.64%  (p=0.000 n=19+20)
/1-4         77.0ns ± 1%    68.3ns ± 1%   -11.21%  (p=0.000 n=20+20)
/10-4         574ns ± 0%     578ns ± 0%    +0.59%  (p=0.000 n=18+20)
/100-4       5.94µs ± 1%    3.07µs ± 0%   -48.28%  (p=0.000 n=19+18)
/1000-4      56.0µs ± 1%    25.8µs ± 1%   -53.92%  (p=0.000 n=20+20)
/10000-4      618µs ± 1%     260µs ± 1%   -57.99%  (p=0.000 n=20+18)
/100000-4    13.1ms ± 6%     3.1ms ± 3%   -76.50%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          48.0B ± 0%      0.0B       -100.00%  (p=0.000 n=20+20)
/1-4          96.0B ± 0%     48.0B ± 0%   -50.00%  (p=0.000 n=20+20)
/10-4          600B ± 0%      600B ± 0%      ~     (all equal)
/100-4       5.64kB ± 0%    3.40kB ± 0%   -39.72%  (p=0.000 n=20+20)
/1000-4      56.0kB ± 0%    25.2kB ± 0%   -55.10%  (p=0.000 n=20+20)
/10000-4      560kB ± 0%     243kB ± 0%   -56.65%  (p=0.000 n=20+20)
/100000-4    5.60MB ± 0%    2.43MB ± 0%   -56.66%  (p=0.000 n=18+20)

name       old allocs/op  new allocs/op  delta
/0-4           1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           2.00 ± 0%      2.00 ± 0%      ~     (all equal)
/10-4          20.0 ± 0%      15.0 ± 0%   -25.00%  (p=0.000 n=20+20)
/100-4          200 ± 0%       107 ± 0%   -46.50%  (p=0.000 n=20+20)
/1000-4       2.00k ± 0%     1.02k ± 0%   -48.95%  (p=0.000 n=20+20)
/10000-4      20.0k ± 0%     10.2k ± 0%   -49.20%  (p=0.000 n=20+20)
/100000-4      200k ± 0%      102k ± 0%   -49.22%  (p=0.000 n=20+20)

Impl7 is:


impl1 (simple slice based queue implementaion) vs impl7.

benchstat rawresults/bench-impl1.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         6.83ns ± 3%    1.18ns ± 3%   -82.79%  (p=0.000 n=20+20)
/1-4         48.1ns ± 6%    68.3ns ± 1%   +42.23%  (p=0.000 n=20+20)
/10-4         532ns ± 5%     578ns ± 0%    +8.55%  (p=0.000 n=20+20)
/100-4       3.19µs ± 2%    3.07µs ± 0%    -3.74%  (p=0.000 n=18+18)
/1000-4      24.5µs ± 3%    25.8µs ± 1%    +5.51%  (p=0.000 n=19+20)
/10000-4      322µs ± 4%     260µs ± 1%   -19.23%  (p=0.000 n=19+18)
/100000-4    15.8ms ±10%     3.1ms ± 3%   -80.60%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          0.00B          0.00B           ~     (all equal)
/1-4          16.0B ± 0%     48.0B ± 0%  +200.00%  (p=0.000 n=20+20)
/10-4          568B ± 0%      600B ± 0%    +5.63%  (p=0.000 n=20+20)
/100-4       4.36kB ± 0%    3.40kB ± 0%   -22.02%  (p=0.000 n=20+20)
/1000-4      40.7kB ± 0%    25.2kB ± 0%   -38.25%  (p=0.000 n=20+20)
/10000-4      746kB ± 0%     243kB ± 0%   -67.47%  (p=0.000 n=20+20)
/100000-4    10.0MB ± 0%     2.4MB ± 0%   -75.84%  (p=0.000 n=15+20)

name       old allocs/op  new allocs/op  delta
/0-4           0.00           0.00           ~     (all equal)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          14.0 ± 0%      15.0 ± 0%    +7.14%  (p=0.000 n=20+20)
/100-4          108 ± 0%       107 ± 0%    -0.93%  (p=0.000 n=20+20)
/1000-4       1.01k ± 0%     1.02k ± 0%    +1.09%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.39%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.54%  (p=0.000 n=20+20)

Impl7 is:

It's important to note that the performance and memory gains for impl7 is exponential like the larger the data set is due to the fact slice based implementations doesn't scale well, paying a higher and higher price, performance and memory wise, every time it needs to scale to accommodate an ever expanding data set.


phf (slice, ring based FIFO queue implementation) vs impl7.

benchstat rawresults/bench-phf.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         28.1ns ± 1%     1.2ns ± 3%   -95.83%  (p=0.000 n=20+20)
/1-4         42.5ns ± 1%    68.3ns ± 1%   +60.80%  (p=0.000 n=20+20)
/10-4         681ns ± 1%     578ns ± 0%   -15.11%  (p=0.000 n=18+20)
/100-4       4.55µs ± 1%    3.07µs ± 0%   -32.45%  (p=0.000 n=19+18)
/1000-4      35.5µs ± 1%    25.8µs ± 1%   -27.32%  (p=0.000 n=18+20)
/10000-4      349µs ± 2%     260µs ± 1%   -25.67%  (p=0.000 n=20+18)
/100000-4    11.7ms ±11%     3.1ms ± 3%   -73.77%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          16.0B ± 0%      0.0B       -100.00%  (p=0.000 n=20+20)
/1-4          16.0B ± 0%     48.0B ± 0%  +200.00%  (p=0.000 n=20+20)
/10-4          696B ± 0%      600B ± 0%   -13.79%  (p=0.000 n=20+20)
/100-4       6.79kB ± 0%    3.40kB ± 0%   -49.94%  (p=0.000 n=20+20)
/1000-4      57.0kB ± 0%    25.2kB ± 0%   -55.86%  (p=0.000 n=20+20)
/10000-4      473kB ± 0%     243kB ± 0%   -48.68%  (p=0.000 n=20+20)
/100000-4    7.09MB ± 0%    2.43MB ± 0%   -65.77%  (p=0.000 n=18+20)

name       old allocs/op  new allocs/op  delta
/0-4           1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          15.0 ± 0%      15.0 ± 0%      ~     (all equal)
/100-4          111 ± 0%       107 ± 0%    -3.60%  (p=0.000 n=20+20)
/1000-4       1.02k ± 0%     1.02k ± 0%    +0.39%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.38%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.54%  (p=0.000 n=20+20)

Impl7 is:


Buffered channel vs impl7.

benchstat rawresults/bench-channel.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         30.2ns ± 1%     1.2ns ± 3%   -96.12%  (p=0.000 n=19+20)
/1-4         87.6ns ± 1%    68.3ns ± 1%   -22.00%  (p=0.000 n=19+20)
/10-4         704ns ± 1%     578ns ± 0%   -17.90%  (p=0.000 n=20+20)
/100-4       6.78µs ± 1%    3.07µs ± 0%   -54.70%  (p=0.000 n=20+18)
/1000-4      67.3µs ± 1%    25.8µs ± 1%   -61.65%  (p=0.000 n=20+20)
/10000-4      672µs ± 1%     260µs ± 1%   -61.36%  (p=0.000 n=19+18)
/100000-4    6.76ms ± 1%    3.07ms ± 3%   -54.61%  (p=0.000 n=19+20)

name       old alloc/op   new alloc/op   delta
/0-4          96.0B ± 0%      0.0B       -100.00%  (p=0.000 n=20+20)
/1-4           112B ± 0%       48B ± 0%   -57.14%  (p=0.000 n=20+20)
/10-4          248B ± 0%      600B ± 0%  +141.94%  (p=0.000 n=20+20)
/100-4       1.69kB ± 0%    3.40kB ± 0%  +101.42%  (p=0.000 n=20+20)
/1000-4      16.2kB ± 0%    25.2kB ± 0%   +55.46%  (p=0.000 n=20+20)
/10000-4      162kB ± 0%     243kB ± 0%   +49.93%  (p=0.000 n=20+20)
/100000-4    1.60MB ± 0%    2.43MB ± 0%   +51.43%  (p=0.000 n=16+20)

name       old allocs/op  new allocs/op  delta
/0-4           1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          10.0 ± 0%      15.0 ± 0%   +50.00%  (p=0.000 n=20+20)
/100-4          100 ± 0%       107 ± 0%    +7.00%  (p=0.000 n=20+20)
/1000-4       1.00k ± 0%     1.02k ± 0%    +2.10%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.61%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.57%  (p=0.000 n=20+20)

Impl7 is:

Above is not really a fair comparison as standard buffered channels doesn't scale (at all) and they are meant for routine synchronization. Nonetheless, they can and make for an excellent bounded FIFO queue option. Still, impl7 is consistently faster than channels across the board, but uses considerably more memory than channels.


Given its excellent performance under all scenarios, the hybrid approach impl7 seems to be the ideal candidate for a high performance, low memory footprint general purpose FIFO queue.

For above reasons, I propose to port impl7 to the standard library.

All raw benchmark results can be found here.

Internal Slice Size

Impl7 uses linked slices as its underlying data structure.

The size of the internal slice does influence performance and memory consumption significantly.

According to the internal slice size bench tests, larger internal slice sizes yields better performance and lower memory footprint. However, the gains diminishes dramatically as the slice size increases.

Below are a few interesting results from the benchmark tests.

BenchmarkMaxSubsequentSliceSize/1-4                20000         76836 ns/op       53967 B/op       2752 allocs/op
BenchmarkMaxSubsequentSliceSize/2-4                30000         59811 ns/op       40015 B/op       1880 allocs/op
BenchmarkMaxSubsequentSliceSize/4-4                30000         42925 ns/op       33039 B/op       1444 allocs/op
BenchmarkMaxSubsequentSliceSize/8-4                50000         36946 ns/op       29551 B/op       1226 allocs/op
BenchmarkMaxSubsequentSliceSize/16-4               50000         30597 ns/op       27951 B/op       1118 allocs/op
BenchmarkMaxSubsequentSliceSize/32-4               50000         28273 ns/op       27343 B/op       1064 allocs/op
BenchmarkMaxSubsequentSliceSize/64-4               50000         26969 ns/op       26895 B/op       1036 allocs/op
BenchmarkMaxSubsequentSliceSize/128-4              50000         27316 ns/op       26671 B/op       1022 allocs/op
BenchmarkMaxSubsequentSliceSize/256-4              50000         26221 ns/op       28623 B/op       1016 allocs/op
BenchmarkMaxSubsequentSliceSize/512-4              50000         25882 ns/op       28559 B/op       1012 allocs/op
BenchmarkMaxSubsequentSliceSize/1024-4             50000         25674 ns/op       28527 B/op       1010 allocs/op

Given the fact that larger internal slices also means potentially more unused memory in some scenarios, 128 seems to be the perfect balance between performance and worst case scenario for memory footprint.

Full results can be found here.

API

Impl7 implements below API methods.

Operation Method
Add func (q *Queueimpl7) Push(v interface{})
Remove func (q *Queueimpl7) Pop() (interface{}, bool)
Size func (q *Queueimpl7) Len() int
Return First func (q *Queueimpl7) Front() (interface{}, bool)

As nil values are considered valid queue values, similarly to the map data structure, "Front" and "Pop" returns a second bool parameter to indicate whether the returned value is valid and whether the queue is empty or not.

The reason for above method names and signatures are the need to keep compatibility with existing Go data structures such as the list, ring and heap packages.

Below are the method names used by the existing list, ring and heap Go data structures, as well as the new proposed queue.

Operation list ring heap queue
Add PushFront/PushBack Link Push Push
Remove Remove Unlink Pop Pop
Size Len Len - Len
Return First Front - - Front

For comparison purposes, below are the method names for C++, Java and C# for their queue implementation.

Operation C++ Java C#
Add push add/offer Enqueue
Remove pop remove/poll Dequeue
Size size - Count
Return First front peek Peek

Drawbacks

The biggest drawback of the proposed implementation is the potentially extra allocated but not used memory in its head and tail slices.

This scenario realizes when exactly 17 items are added to the queue, causing the creation of a full sized internal slice of 128 positions. Initially only the first element in this new slice is used to store the added value. All the other 127 elements are already allocated, but not used.

// Assuming a 128 internal sized slice.
q := queueimpl7.New()

// Push 16 items to fill the first dynamic slice (sized 16).
for i := 1; i <= 16; i++ {
   q.Push(i)
}
// Push 1 extra item that causes the creation of a new 128 sized slice to store this value.
q.Push(17)

// Pops the first 16 items to release the first slice (sized 16).
for i := 1; i <= 16; i++ {
   q.Pop()
}

// As unsafe.Sizeof (https://golang.org/pkg/unsafe/#Sizeof) doesn't consider the length of slices,
// we need to manually calculate the memory used by the internal slices.
var internalSliceType interface{}
fmt.Println(fmt.Sprintf("%d bytes", unsafe.Sizeof(q)+(unsafe.Sizeof(internalSliceType) /* bytes per slice position */ *127 /* head slice unused positions */)))

// Output for a 64bit system (Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz): 2040 bytes

The worst case scenario realizes when exactly 145 items are added to the queue and 143 items are removed. This causes the queue struct to hold a 128-sized slice as its head slice, but only the last element is actually used. Similarly, the queue struct will hold a separate 128-sized slice as its tail slice, but only the first position in that slice is being used.

// Assuming a 128 internal sized slice.
q := queueimpl7.New()

// Push 16 items to fill the first dynamic slice (sized 16).
for i := 1; i <= 16; i++ {
   q.Push(i)
}

// Push an additional 128 items to fill the first full sized slice (sized 128).
for i := 1; i <= 128; i++ {
   q.Push(i)
}

// Push 1 extra item that causes the creation of a new 128 sized slice to store this value,
// adding a total of 145 items to the queue.
q.Push(1)

// Pops the first 143 items to release the first dynamic slice (sized 16) and
// 127 items from the first full sized slice (sized 128).
for i := 1; i <= 143; i++ {
   q.Pop()
}

// As unsafe.Sizeof (https://golang.org/pkg/unsafe/#Sizeof) doesn't consider the length of slices,
// we need to manually calculate the memory used by the internal slices.
var internalSliceType interface{}
fmt.Println(fmt.Sprintf("%d bytes", unsafe.Sizeof(q)+(unsafe.Sizeof(internalSliceType) /* bytes per slice position */ *(127 /* head slice unused positions */ +127 /* tail slice unused positions */))))

// Output for a 64bit system (Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz): 4072 bytes

Above code was run on Go version "go1.11 darwin/amd64".

Open Questions/Issues

Should this be a deque (double-ended queue) implementation instead? The deque could be used as a stack as well, but it would make more sense to have a queue and stack implementations (like most mainstream languages have) instead of a deque that can be used as a stack (confusing). Stack is a very important computer science data structure as well and so I believe Go should have a specialized implementation for it as well (given the specialized implementation offers real value to the users and not just a "nice" named interface and methods).

Should "Pop" and "Front" return only the value instead of the value and a second bool parameter (which indicates whether the queue is empty or not)? The implication of the change is adding nil values wouldn't be valid anymore so "Pop" and "Front" would return nil when the queue is empty. Panic should be avoided in libraries.

The memory footprint for a 128 sized internal slice causes, in the worst case scenario, a 2040 bytes of memory allocated (on a 64bit system) but not used. Switching to 64 means roughly half the memory would be used with a slight ~2.89% performance drop (252813ns vs 260137ns). The extra memory footprint is not worth the extra performance gain is a very good point to make. Should we change this value to 64 or maybe make it configurable?

Should we also provide a safe for concurrent use implementation? A specialized implementation that would rely on atomic operations to update its internal indices and length could offer a much better performance when comparing to a similar implementation that relies on a mutex.

With the impending release of generics, should we wait to release the new queue package once the new generics framework is released?

Should we implement support for the range keyword for the new queue? It could be done in a generic way so other data structures could also benefit from this feature. For now, IMO, this is a topic for another proposal/discussion.

Summary

I propose to add a new package, "container/queue", to the standard library to support an in-memory, unbounded, general purpose queue implementation.

I feel strongly this proposal should be accepted due to below reasons.

1) The proposed solution was well researched and probed, being dramatically and consistently faster than 6 other experimental queue implementations as well 3 promising open source queue implementations as well the standard list package and buffered channels; it still consumes considerably less memory than every other queue implementation tested, except for buffered channels 2) The proposed solution uses a new, unique approach to building queues, yet its implementation is clean and extremely simple. Both main methods, "Push" and "Pop", are composed of only 16 and 19 lines of code (total), respectively. The proposed implementation also have proper tests with 100% test coverage and should require minimal maintenance moving forward 3) I'll implement any changes the Go community feel are needed for the proposed solution to be worth of the standard library and the Go community


Update 11/26/2018.

Due to many suggestions to make the queue a deque and to deploy it as a proper external package, the deque package was built and deployed here. The proposal now is to add the deque package to the standard library instead of impl7. Refer here for details.

cznic commented 5 years ago

Unbound queues don't fit well machines with finite memory.

christianrpetrin commented 5 years ago

@cznic I get your point, but I disagree with it for a number of reasons.

1) Modern devices, even mobile ones, comes with gigabytes of memory available. Efficient data structures that allow engineers to use that much memory efficiently are very welcome 2) Go already offers a few unbounded data structures such as list, map and others. I don't understand why Go should stop now 3) In modern cloud computing, it is fairly easy to grow memory usage by distributing the growth to more than one device. The technique I'm using on CloudLogger, for instance, is to allow for infinite memory growth in a single VM, but configuring Kubernetes to auto-scale based on memory. So when a node reaches a certain limit, say 50%, of its used memory, Kubernetes will deploy a new pod and start to round robin the requests (which causes the memory usage to go up) between the old and new pod. This way I effectively (and automatically, with no code changes) can scale one device memory into many others. Techniques like this are bread and butter of modern cloud computing. An amazing language like Go should not restrict the engineers in being able to use such techniques

Maybe the term "unbounded" is a bit confusing. Maybe we should change from "unbounded queue" to "dynamically growing queue". Dynamically growing, which in the queue scenario means the same thing as unbounded, would probably be less confusing. Thoughts?

networkimprov commented 5 years ago

Agreed that the stdlib or golang.org/x repo would benefit from containers like queue and btree. It may become easier to land new containers once generics lands, tho that's a long way off yet.

Dynamic or "elastic" channel buffers have been proposed (and declined) in two forms: a channel buffer pool #20868, and unlimited channel buffer #20352. A workable alternative is one-goroutine-plus-two-channels as in this elastic channel example. It might be useful to include that in your benchmark set.

Container structures are typically unbounded/elastic, so that's unlikely to be a recurring objection here :-)

christianrpetrin commented 5 years ago

@networkimprov, thanks for posting the other channel proposals, and agreeing with the queue proposal! :-)

I looked into https://github.com/kylelemons/iq and both queue designs implemented here, a slice, ring based design as well as the dynamic channel.

I have benchmarked two high quality open source, similarly slice ring based implementations: phf and gammazero.

I have also benchmarked channels as queue.

Impl7 is dramatically faster than all ring slice based implementations tested as well as channels.

Having said that, I tried to benchmark both implementations, but the design for both queues is so "unique" I had trouble getting both implementations to fit my benchmark tests.

I couldn't help but to notice the "Infinite Queue implementations in Go. These are bad news, you should always use bounded queues, but they are here for educational purposes." title on the repo.

Blanket statements like this that are based on faulty research really doesn't help. I wish the author would do a better job researching other design and implementation options before making such statements.

cznic commented 5 years ago

Blanket statements like this that are based on faulty research really doesn't help.

The above is a blanket statement because it claims some undefined research is faulty without providing any evidence ;-)

phf commented 5 years ago

Thanks for noticing my queue. I obviously agree that a queue should be in the standard library. Personally I believe that a deque is what you want in the standard library. It's usually not much more work to implement (or much more complex code-wise) but it's twice as useful and not (in principle) harder to make fast. The proposal sounds good and I wish it success. Oh, and congrats on beating my queue! :-)

wsc1 commented 5 years ago

Nice queue!

However, my feeling is that the state of affairs of containers and the standard library is such that a global assessment of what the standard library should provide for containers before considering inclusion based on the merits of a single implementation of a single data structure.

networkimprov commented 5 years ago

Re benchmarking kylelemons/iq, its two channels simulate a single standard channel (one is input, the other output). But you already knew that, so the issue must be the goroutine?

Re its claim, "you should always use bounded queues," that means channel queues. As channels are principally an "IGrC" mechanism, if the consumer goroutine stops, the producer/s must block on c <- x (iow feel "backpressure").

I gather the API isn't thread safe? Have you benchmarked a concurrent scenario, e.g. multiple pushers & single popper? That's where a comparison with kylelemons/iq is most relevant. (And that's how I'd use your queue.)

Related: @ianlancetaylor once voiced support for the idea of allocating a bounded channel buffer as-needed, and your list-of-slices technique might apply there. There's not a separate proposal for that as yet.

egonelbre commented 5 years ago

I remember https://github.com/karalabe/cookiejar/tree/master/collections having a decent implementation.

Good concurrent queue will need several things i.e. lock-freedom in the happy bath, non-spinning when blocking. For different concurrent situations (MPMC, SPMC, MPSC, SPSC) and high-performance, the code would also look different; since not having to manage concurrency at a particular producer/consumer helps to speed up things.

As for tests you should also test the slowly increasing cases (i.e. 2 x Push, 1 x Pop or 3 x Push, 2 x Pop) and slowly decreasing cases (i.e. fill queue, then 1 x Push, 2 x Pop or 2 x Push, 3 x Pop). Also this document is not including the stable-cases (fill 1e6, or larger depending on internal block size, then repeat for N 1 x Push, 2 x Pop). Also missing large cases with 10e6 items. And average might not be the best measurement for these, look at percentiles also.

bcmills commented 5 years ago

Should this be a deque (double-ended queue) implementation instead?

I believe that the two-array deque is conventional in language libraries. It tends to provide better cache behavior than a linked-list approach, and a lower cost (particularly in the face of mixed push-front/push-back usage) than a single-array queue.

bcmills commented 5 years ago

Good concurrent queue will need several things i.e. lock-freedom in the happy bath, non-spinning when blocking.

“Lock-freedom” is not particularly meaningful for programs that run on multiple cores. The CPU's cache coherence protocol has many of the same properties as a lock.

bcmills commented 5 years ago

As @cznic notes, unbounded queues are not usually what you want in a well-behaved program anyway.

For example, in a Logger implementation, you don't want to consume arbitrary memory and crash (and lose your pending logs) if one of your logging backends can't keep up. If the logs are important, it's better to apply flow-control and delay the program until they can be sent. If the logs are not important, it's better to drop some logs when the buffer gets full, rather than dropping the entire program state.

bcmills commented 5 years ago

Given the number and variety of queue implementations that already exist, and their variation on properties such as push-front support and concurrency-safety, I would prefer that we not add a queue to the standard library — especially not before we figure out what's going on with generics (https://github.com/golang/go/issues/15292).

uluyol commented 5 years ago

I agree that when used for concurrency, unbounded queues are a bad fit. However, unbounded queues are useful for sequential work e.g. when doing a breadth first search. We use slices that can grow arbitrarily, maps that can grow arbitrarily, and stacks that can grow arbitrarily, why can't the programmer be trusted to use them when appropriate?

I don't understand what the push back is on the idea (though I agree that it should wait on generics). An unbounded queue is not an unbounded channel.

christianrpetrin commented 5 years ago

Thanks @egonelbre for pointing out to this https://github.com/karalabe/cookiejar/blob/master/collections/queue/queue.go queue implementation. This is an interesting design based on a dynamically growing circular slice of blocks.

I have added this queue to the bench tests. Feel free to clone the repo and run the tests to validate below results by yourself.

benchstat rawresults/bench-cookiejar.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         9.47µs ± 0%    0.00µs ± 3%   -99.99%  (p=0.000 n=19+20)
/1-4         9.52µs ± 1%    0.07µs ± 1%   -99.28%  (p=0.000 n=18+20)
/10-4        9.76µs ± 1%    0.58µs ± 0%   -94.08%  (p=0.000 n=19+20)
/100-4       11.6µs ± 1%     3.1µs ± 0%   -73.54%  (p=0.000 n=20+18)
/1000-4      29.6µs ± 2%    25.8µs ± 1%   -12.71%  (p=0.000 n=20+20)
/10000-4      231µs ± 1%     260µs ± 1%   +12.56%  (p=0.000 n=19+18)
/100000-4    3.21ms ±13%    3.07ms ± 3%    -4.41%  (p=0.000 n=18+20)

name       old alloc/op   new alloc/op   delta
/0-4         65.6kB ± 0%     0.0kB       -100.00%  (p=0.000 n=20+20)
/1-4         65.6kB ± 0%     0.0kB ± 0%   -99.93%  (p=0.000 n=20+20)
/10-4        65.6kB ± 0%     0.6kB ± 0%   -99.09%  (p=0.000 n=20+20)
/100-4       66.4kB ± 0%     3.4kB ± 0%   -94.88%  (p=0.000 n=20+20)
/1000-4      73.6kB ± 0%    25.2kB ± 0%   -65.80%  (p=0.000 n=20+20)
/10000-4      277kB ± 0%     243kB ± 0%   -12.29%  (p=0.000 n=20+20)
/100000-4    2.45MB ± 0%    2.43MB ± 0%    -0.79%  (p=0.000 n=20+20)

name       old allocs/op  new allocs/op  delta
/0-4           2.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           2.00 ± 0%      2.00 ± 0%      ~     (all equal)
/10-4          11.0 ± 0%      15.0 ± 0%   +36.36%  (p=0.000 n=20+20)
/100-4          101 ± 0%       107 ± 0%    +5.94%  (p=0.000 n=20+20)
/1000-4       1.00k ± 0%     1.02k ± 0%    +2.00%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.56%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.52%  (p=0.000 n=20+20)

This queue does have good performance for fairly large data sets, but its performance for small data sets, performance and memory wise, are not very good.

This is a very good implementation, I really like it, but:

1) Due to its poor performance with small data sets, I would not regard this as a well balanced, general purpose queue 2) I have concerns regarding memory leaks on this queue. I don't see anywhere in the code where the just released position in the internal slice gets set to nil 3) This queue suffers from the same problem as every other pure slice based queues suffer: copying of existing elements when the queue needs to grow

christianrpetrin commented 5 years ago

@networkimprov the kylelemons/iq is a very unique design which seems to me was built to prove a point the channels does not make good unbounded queues.

I can't agree more with this. Channels are meant for routine synchronization, not as a sequential data store like a traditional queue can, which by the way, is a core use case for queues (if you can't process fast enough, store). That's why I was actually adamant in including buffered channels in the tests: they are not meant to be used as queues. If people really want to use them as queues, then a separate proposal with a separate discussion should follow. This is not a "dynamic channel" proposal. Please keep that in mind.

Regarding the claim, "you should always use bounded queues,", I understood (by reading the code) that he/she meant channels. My point was that his/her statement is not clear and bounded enough, so people reading might think no unbounded queues, regardless of its design, implementation, performance, etc are good. I can't agree with that.

Impl7 is not safe for concurrent use. As the map and all stdlib containers are not safe for concurrent use, the new queue should follow on the pattern. Not all users will need a safe for concurrent use queue. So IMO, a safe for concurrent use version is a topic for another proposal/discussion just like we have a separate sync.Map for a safe for concurrent use map implementation.

I do talk a bit about the subject on the design document though.

christianrpetrin commented 5 years ago

As @cznic notes, unbounded queues are not usually what you want in a well-behaved program anyway.

For example, in a Logger implementation, you don't want to consume arbitrary memory and crash (and lose your pending logs) if one of your logging backends can't keep up. If the logs are important, it's better to apply flow-control and delay the program until they can be sent. If the logs are not important, it's better to drop some logs when the buffer gets full, rather than dropping the entire program state.

That's sort of the plan. Take a look at this response. I can't apply flow control all the way to the clients. This would make the hosting system slower, affecting its clients (the human beings). Non-critical infrastructure such as logging should not affect the hosting system significantly. This is a problem that has no solution other than to either queue or discard the logs. Discard should be the very last approach, only used when queueing is no longer possible. FWIW, CloudLogger already treats important logs in a very different way than regular logs, so even if one of the backend servers dies and a whole bunch of logs get lost, that's by design. All the important logs are not long term cached and will not be lost (in large numbers, at least).

Also I get your concerns around using too much memory, but think of this queue as a data structure that is highly efficient also in dealing with large amounts of data, gigabytes of data. It has to be unbounded because each device has a certain amount of available memory to be used at any given time. It should be up to the engineer using the queue to decide how much memory he/she wants to use. It's the job of the language designers to provide tools that allow the engineers to handle their needs with the maximum possible efficient, and not dictate how much memory their programs should use.

If we do have and can provide to all Go engineers a data structure that is flexible enough they know they will get really good performance no matter their needs, I think that data structure brings real value. Go engineers have been building their queues for a long time now. Often with (pretty bad results)[https://github.com/christianrpetrin/queue-tests/blob/master/bench_queue.md]. That's also my point here. Go engineers are already doing "the bad" thing of using a lot of memory. The proposed solution just allows them to keep doing that in a much more efficient way.

Again, I'm not proposing a super crazy, specialized data structure here. Coming from Java and C#, I was super surprised Go doesn't offer a traditional queue implementation. Having a traditional queue implementation, especially a super flexible and fast one, also helps bring all those Java, C#, etc devs to the Go world.

christianrpetrin commented 5 years ago

Given the number and variety of queue implementations that already exist, and their variation on properties such as push-front support and concurrency-safety, I would prefer that we not add a queue to the standard library — especially not before we figure out what's going on with generics (#15292).

@bcmills please consider below.

1) The queue proposed here is dramatically faster than every other queue tested. If you known of another queue implementation you feel like might be a better design/implementation than the proposed here, please let me know and I'm extremely happy to validate it. If we find a faster, balanced queue implementation, I'm more than happy to root for that implementation to make into the stdlib. As a matter of fact, I'd definitely use it in all my projects 2) @networkimprov mentioned generics are a long way off. Should we wait and hold off improvements until we finally decide what we want to do with generics? How long have we been talking about generics? How much longer will it take? IMO we should keep on with an agile/SCRUM like approach to Go development where small, incremental improvements are constantly built and released. That's one of the reasons why Go is beating every other mainstream languages out there. We should allow and foster organic growth and constant small improvements to the language. High level discussions do have their place and should exist, but they can take place unilaterally and should not hold off development for very long. 3) I'm very happy to implement changes the proposed the queue to make it better. All suggestions and criticism are most welcome. Especially the people who is thumbing down my proposal, I'm not seeing many comments why the proposal is not good. Please let us know why you disagree with the proposal and reply back to my comments if I do so, so we can have a constructive conversation.

cznic commented 5 years ago

Slices can be used as queues, but often they're not. Sometimes a program collects items and sort them, for example. A bounded slice is not helpful in this case. If the sorting requires more memory than available, there's no other option, in the first approximation, than to crash.

Maps serve a different purpose, but the argument about bound/unbound is the same.

There are many other general data structures with different big-O characteristics of their particular operations. One usually carefully selects the proper one and the argument about bound/unbound is still the same.

Channels typically serve as a queue between concurrently executing producers and consumers. They have a limit that blocks producers when reached and provides useful back pressure. Making a channel unlimited loses that property.

A non-channel queue (FIFO) offers a rather limited set of operations: Put and Get (or whatever the names were chosen). If it is to be used in the scenario where a slice, map or other generic data structure can be used, then there's no real need to have or use it.

The remaining scenarios for using a FIFO are IMO mostly - if not only - where the producers and consumers execute concurrently. If a FIFO is used effectively as a channel there are two important questions to answer:

Also note that slices and maps that exist today in the language are type generic. The proposed unbound queue seems to use interface{} typed items. That adds memory overhead and hurts cache locality. I haven't really studied the code, but in a proper benchmark, []interface{} cannot beat []int in any of those two aspects. Similarly for []interface{} vs a buffered chan int.

bcmills commented 5 years ago

If you known of another queue implementation you feel like might be a better design/implementation than the proposed here, please let me know and I'm extremely happy to validate it.

See https://golang.org/cl/94137 (which I ought to dust off and submit one of these days).

christianrpetrin commented 5 years ago

Also to help validate my point of the whole bunch of problematic Go queues out there, I was able to validate the cookiejar queue actually has a memory leak. Just ran some bench tests locally with large items added.

go test -benchmem -count 1 -benchtime 60s -timeout 60m -bench=".*Cookiejar" -run=^$
goos: darwin
goarch: amd64
pkg: github.com/christianrpetrin/queue-tests
BenchmarkCookiejar/1000000-4                2000      44099280 ns/op    24830036 B/op    1000489 allocs/op
BenchmarkCookiejar/10000000-4                200     464993059 ns/op    317408985 B/op  10004883 allocs/op
BenchmarkCookiejar/100000000-4                10    7453667009 ns/op    9649322720 B/op 100048829 allocs/op
^Csignal: interrupt
FAIL    github.com/christianrpetrin/queue-tests 1510.242s

The memory growth is not linear to the growth of work. Also the last 1bi test took so long (5+ minutes), I had to abort the test. Impl7 doesn't suffer from these problems. I would most certainly not recommend cookiejar queue implementation until these issues are fixed.

ianlancetaylor commented 5 years ago

This is a worthy proposal, but, since generics are an active topic of discussion (https://go.googlesource.com/proposal/+/master/design/go2draft.md), I agree with @bcmills that we should not pursue this further until that discussion has settled. It will inevitably affect the API, and possibly the implementation, of any new container types in the standard library.

ianlancetaylor commented 5 years ago

Let me add that a queue implementation is just as good outside the standard library as in. There is no rush to adding queues to the standard library; people can use them effectively as an external package.

networkimprov commented 5 years ago

Specially the people who is thumbing down my proposal, I'm not seeing many comments why the proposal is not good.

Go philosophy takes a "when in doubt, leave it out" stance, so some ppl thumb-down stuff they hardly/never use. For example, notice the thumbs on this proposal to drop complex number support: #19921. From what I've read, the maintainers don't make decisions based on thumb-votes.

christianrpetrin commented 5 years ago

@ianlancetaylor agree the generics discussion should move forward. My only concern is how much longer will it take until a final decision is made and deployed? There's a clear need of a general purpose queue in the Go stdlib. Otherwise we wouldn't see so many open source queue implementations out there. Can we get some sort of timeline on when the decision on generics will be made?

Also agree a queue could be deployed to any open source repo, but please read the background piece of my proposal to help you understand why I think there's a serious need for a high perf, issue free queue in the Go stdlib. To highlight the point:

Using external packages has to be done with great care. Packages on the stdlib, however, are supposed to be really well tested and performatic, as they are validated by the whole Go community. A decision to use a stdlib package is an easy one, but to use an open source, not well disseminated implementation is much more complicated decision to make. The other big problem is how does people find out what are the best, safe to use queue implementations? As pointed by @egonelbre, this one looks like a really interesting queue, but it has hidden memory leaks. Most Go engineers doesn't have the time to validate and properly probe these queues. That's the biggest value we can provide to the whole community: a safe, performant and very easy queue to be used.

networkimprov commented 5 years ago

@ianlancetaylor how about landing a queue in golang.org/x in the interim?

ianlancetaylor commented 5 years ago

There is no specific timeline for generics support.

The argument about what should be in the standard library is a large one that should not take place on this issue. I'll just mention the FAQ: https://golang.org/doc/faq#x_in_std .

cznic commented 5 years ago

There's a clear need of a general purpose queue in the Go stdlib. Otherwise we wouldn't see so many open source queue implementations out there.

There is an unknown, but rather big number of things that are not found in the stdlib which have several open source implementations. Per the above quoted logic construction, all of them should go to the stdlib. However, I think such conclusion is not true.

I still maintain the stance, that

Reasons were given earlier.

This is to certain extent similar to why the stdlib does not provide a backtracking regular expression engine implementation. Anyone is free to use one, but wrong solutions do not belong to the stdlib.

edit: typo

christianrpetrin commented 5 years ago

@bcmills I ported your implementation to here. Please let me know if you are fine with it. Otherwise I'll remove it.

I did commented out the routine synchronization code, otherwise comparing it to the proposal queue wouldn't be fair. I also needed to change a bit the NextEvent() method so it would return nil instead of block when there's no more items in the queue. This way I can get it to fit the tests better. Please let me know if this port looks good. I'll implement any changes you like and re-run the tests.

Results:

benchstat rawresults/bench-bcmills.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         2.90ns ± 1%    1.18ns ± 3%   -59.47%  (p=0.000 n=19+20)
/1-4         46.4ns ± 0%    68.3ns ± 1%   +47.27%  (p=0.000 n=18+20)
/10-4         897ns ± 1%     578ns ± 0%   -35.59%  (p=0.000 n=20+20)
/100-4       3.89µs ± 1%    3.07µs ± 0%   -21.00%  (p=0.000 n=19+18)
/1000-4      34.9µs ± 1%    25.8µs ± 1%   -26.15%  (p=0.000 n=19+20)
/10000-4      470µs ± 1%     260µs ± 1%   -44.75%  (p=0.000 n=18+18)
/100000-4    28.2ms ± 9%     3.1ms ± 3%   -89.13%  (p=0.000 n=19+20)

name       old alloc/op   new alloc/op   delta
/0-4          0.00B          0.00B           ~     (all equal)
/1-4          16.0B ± 0%     48.0B ± 0%  +200.00%  (p=0.000 n=20+20)
/10-4        1.06kB ± 0%    0.60kB ± 0%   -43.61%  (p=0.000 n=20+20)
/100-4       4.86kB ± 0%    3.40kB ± 0%   -29.98%  (p=0.000 n=20+20)
/1000-4      73.5kB ± 0%    25.2kB ± 0%   -65.77%  (p=0.000 n=20+20)
/10000-4     1.03MB ± 0%    0.24MB ± 0%   -76.37%  (p=0.000 n=20+20)
/100000-4    19.3MB ± 0%     2.4MB ± 0%   -87.42%  (p=0.000 n=13+20)

name       old allocs/op  new allocs/op  delta
/0-4           0.00           0.00           ~     (all equal)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          19.0 ± 0%      15.0 ± 0%   -21.05%  (p=0.000 n=20+20)
/100-4          113 ± 0%       107 ± 0%    -5.31%  (p=0.000 n=20+20)
/1000-4       1.02k ± 0%     1.02k ± 0%      ~     (all equal)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.26%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.51%  (p=0.000 n=20+20)

As compared to pretty much every other queue implementation tested, impl7 is very competitive or faster with low data sets and perform dramatically faster with large data sets, performance and memory wise. Impl7 is ~9x faster for 100k items (3.1ms vs 28.2) and uses only ~1/8 of the memory (2.4MB vs 19.3MB). This is pretty significant.

Please consider the performance of the queue as well in your decision whether to include the queue in the stdlib or not.

ianlancetaylor commented 5 years ago

@networkimprov The argument about generics applies to golang.org/x just as much as to the standard library.

christianrpetrin commented 5 years ago

There's a clear need of a general purpose queue in the Go stdlib. Otherwise we wouldn't see so many open source queue implementations out there.

There is an unknown, but rather big number of things that are not found in the stdlib which have several open source implementations. Per the above quoted logic construction, all of them should go to the stdlib. However, I think such conclusion is not true.

I still maintain the stance, that

  • an unbound queue/FIFO is a bad idea in the first place
  • no new interface{} or []interface{} based container should go into the stdlib.

Reasons were given earlier.

This is to certain extent similar to why the stdlib does not provide a backtracking regular expression engine implementation. Anyone is free to use one, but wrong solutions do not belong to the stdlib.

edit: typo

If you consider just the first part of my statement and not the last, your logic certainly makes sense, but your are taking my comments out of context. Please quote the entire phrase, and not just the first part. On the quoted comment, I further state why I believe the first sentence is true in the third paragraph. Please don't leave those out of your quotes.

If you truly believe unbounded data structures are a bad idea, then we should propose to remove slice, map, sync.Map and the list data structures from stdlib..

bcmills commented 5 years ago

@christianrpetrin, note that your benchmarks are testing only one very specific mode of usage: one in which test.count items are pushed (possibly with ⅓ of the items popped in the process), then the remaining items are popped, then the container is thrown away.

That is certainly one way to use a queue, but it is by no means the only way. In other modes of use (for example, in time- or batch-based streams) a queue will be repeatedly emptied and refilled over time.

I think you will find that the performance characteristics of the various implementations change significantly with the mode of use.

christianrpetrin commented 5 years ago

@bcmills yep, I realized that could affect the results as well. That's why the tests that add 100 and 10k items to the queues add 2 items, remove 1; add another 2, remove 1; and so on. Please review the test code here.

I have been running bench tests full time for two weeks now. I have run all sorts of tests, including super long ones; ones that add and remove items randomly, etc. I'm still to find a faster, balanced queue than the one proposed here in virtually any scenario.

cznic commented 5 years ago

If you consider just the first part of my statement and not the last, your logic certainly makes sense, but your are taking my comments out of context.

I've quoted in full the what was claimed and the proof of the same.

If you truly believe unbounded data structures are a bad idea, then we should propose to remove slice, map, sync.Map and the list data structures from stdlib..

I've argued in https://github.com/golang/go/issues/27935#issuecomment-425975361 exactly the other way around. I wonder why you suggest otherwise.

@bcmills Yup. I don't trust a benchmark that says []interface{} can work faster than []int and/or use less memory. I don't know where the problem in the benchmark is, but except for some special cases it's an impossible result.

christianrpetrin commented 5 years ago

I'll leave the challenge here: if anyone can point out any significant issues either in the queue implementation or the tests, please let me know and I'll fix it right away.

Also if anyone thinks he/she has a better queue implementation, let me know and I'll benchmark against impl7 gladly.

But please people, I quit my full time job just to work on things that I absolutely love. And Go and open source software feature very high on my "things a love" list. I have spent two weeks full time working on this proposal. I see a lot of negative comments from people that apparently didn't even read the proposal entirely. Please take your time and read and analyze the whole proposal and not just one aspect of it. Again, my only interest here is to help and I'm extremely happy to see where are the holes in my research, so please leave constructive comments.

cznic commented 5 years ago

Please note that this proposal is about extending the API of the stdlib by a particular functionality. Any code is at this moment not that much important even though having an initial implementation is nice when a proposal gets eventually adopted.

The much more important things are the decision if the proposed functionality should or should not go in and getting the API as right as possible. Then any suitable implementation can be adopted. Yours or possibly someone else's in the future if it's better.

But please people, I quit my full time job just to work on things that I absolutely love. And Go and open source software feature very high on my "things a love" list. I have spent two weeks full time working on this proposal. I see a lot of negative comments from people that apparently didn't even read the proposal entirely. Please take your time and read and analyze the whole proposal and not just one aspect of it. Again, my only interest here is to help and I'm extremely happy to see where are the holes in my research, so please leave constructive comments.

You may have got involved more than necessary but that's not a fault of anyone else. Please try to remove yourself from the equation. If someone thinks something is technically correct/incorrect, he/she should not care how much time someone spent with preparing the proposal and/or writing the code. That's not a technically valid criterion for adopting or rejecting a proposal.

Also questions were raised in this thread that still seem rather open.

christianrpetrin commented 5 years ago

@cznic agree with everything :-)

My rant was about people complaining about the proposal with arguments already clearly answered in the proposal.

The only open question I'm aware of is if we should wait for generics or not.

There's some comments about more testing. I will wholeheartedly implement all the suggested tests, but they should not be a blocker for the proposal to be accepted or not.

Are you aware of any other open questions that are blocker?

networkimprov commented 5 years ago

@ianlancetaylor given a "worthy proposal," can you describe the rationale for restraining work in golang.org/x while generics is under construction? Because you'd maintain the pre-generic version indefinitely? If so, is that worth it in some cases?

@christianrpetrin, it might help to find a way to count usage of queue packages on Github. Regarding repeated comments that reject your core assumptions, silence can be a powerful argument :-)

I know the feeling of setting other work aside for weeks to draft a Go proposal. (More on my profile log.)

ianlancetaylor commented 5 years ago

@networkimprov Given that generics is being actively discussed, it is my opinion that we should wait for that to be finalized before the Go team takes on any responsibility for any additional container packages of any type. (I understand that @christianrpetrin is volunteering to maintain this code, but if it's in the standard library or golang.org/x then it's the Go team's responsibility.)

Merovius commented 5 years ago

My first thought when seeing the benchmarks was, that you should probably benchmark against a non-generic version of Impl1 (i.e. don't use interface{}, but int). On the surface it's not an apple-to-apple comparison. But given that Impl1 is trivial to implement correctly and Impl7 isn't, it still gives a valuable data point to see if the performance gained by generic version is worth it for the boilerplate of using a trivial hand-rolled one.

FWIW, I quickly whipped up a non-generic one and it hits the ball out of the park over Impl7, with an at most linear (2x) memory overhead:

BenchmarkImpl1/0-4      200000000            7.86 ns/op        0 B/op          0 allocs/op
BenchmarkImpl1/1-4      20000000            57.2 ns/op        16 B/op          1 allocs/op
BenchmarkImpl1/10-4      2000000           744 ns/op         568 B/op         14 allocs/op
BenchmarkImpl1/100-4              300000          4808 ns/op        4360 B/op        108 allocs/op
BenchmarkImpl1/1000-4              30000         39089 ns/op       40744 B/op       1010 allocs/op
BenchmarkImpl1/10000-4              3000        449350 ns/op      746344 B/op      10022 allocs/op
BenchmarkImpl1/100000-4              100      17871414 ns/op    10047344 B/op     100029 allocs/op
BenchmarkImpl1x/0-4             2000000000           1.26 ns/op        0 B/op          0 allocs/op
BenchmarkImpl1x/1-4             50000000            29.5 ns/op         8 B/op          1 allocs/op
BenchmarkImpl1x/10-4             5000000           307 ns/op         248 B/op          5 allocs/op
BenchmarkImpl1x/100-4            1000000          1440 ns/op        1688 B/op          9 allocs/op
BenchmarkImpl1x/1000-4            200000         10972 ns/op       16376 B/op         11 allocs/op
BenchmarkImpl1x/10000-4            10000        152085 ns/op      351896 B/op         23 allocs/op
BenchmarkImpl1x/100000-4            1000       1452986 ns/op     4654334 B/op         30 allocs/op
BenchmarkImpl7/0-4              2000000000           1.27 ns/op        0 B/op          0 allocs/op
BenchmarkImpl7/1-4              20000000            95.5 ns/op        48 B/op          2 allocs/op
BenchmarkImpl7/10-4              2000000           845 ns/op         600 B/op         15 allocs/op
BenchmarkImpl7/100-4              300000          4609 ns/op        3400 B/op        107 allocs/op
BenchmarkImpl7/1000-4              50000         37287 ns/op       25160 B/op       1021 allocs/op
BenchmarkImpl7/10000-4              3000        416314 ns/op      242760 B/op      10161 allocs/op
BenchmarkImpl7/100000-4              300       4243316 ns/op     2427085 B/op     101569 allocs/op

(BenchmarkImpl1x is the specialized one). Note, that in particular the claim in the proposal that for very large data sets the copy of slices is prohibitively expensive doesn't hold up to scrutiny when pitched against the far increased allocation costs that the generic implementation requires. Also note, that the O(n) number of allocations a generic implementation needs for n pushes slows down the whole program, because of increased GC pressures, whereas the non-generic version doesn't really affect the program as a whole, as the number of pointers/allocations isn't significant.

Generics might be able to improve all of that (it's hard to say, without knowing how they will be implemented), but at least for now, TBH, I'd still recommend anyone to stick with a simple slice over the proposal. ISTM that the central argument of the proposal - that a performant, correct queue implementation is hard to write - just isn't really that strong. Yes, the trivial implementation uses more memory. But not that much, and its performance is significantly faster.

christianrpetrin commented 5 years ago

@Merovius thanks for spending some time analyzing the proposed solution.

Now if you are going to compare something against something else, it's absolutely critical to compare apples to apples. Otherwise you'll get biased results.

So I quickly changed impl1 and impl7 to use int instead of "interface{}" (locally only). Just the internal data type was changed, nothing else. I also added a few more large data set tests so we can get a better picture of how much it actually cost to copy data around for large data sets.

Ran the tests locally with below commands.

go test -benchmem -count 10 -benchtime 3s -timeout 60m -bench=".*Impl1X" -run=^$ > bench-impl1x.txt
go test -benchmem -count 10 -benchtime 3s -timeout 60m -bench=".*Impl7X" -run=^$ > bench-impl7x.txt

Results

benchstat bench-impl1x.txt bench-impl7x.txt
name           old time/op    new time/op    delta
/0-4             6.87ns ± 1%    1.16ns ± 0%   -83.11%  (p=0.000 n=10+7)
/1-4             31.6ns ± 2%    56.4ns ± 5%   +78.56%  (p=0.000 n=10+10)
/10-4             322ns ± 1%     391ns ± 4%   +21.49%  (p=0.000 n=9+9)
/100-4           2.28µs ± 2%    2.37µs ± 0%    +4.20%  (p=0.000 n=10+8)
/1000-4          19.3µs ± 2%    21.0µs ± 0%    +8.81%  (p=0.000 n=9+9)
/10000-4          213µs ± 3%     214µs ± 1%      ~     (p=0.579 n=10+10)
/100000-4        2.13ms ± 1%    2.09ms ± 0%    -1.75%  (p=0.000 n=10+8)
/1000000-4       20.7ms ± 2%    23.5ms ± 4%   +13.16%  (p=0.000 n=10+10)
/10000000-4       205ms ± 2%     224ms ± 4%    +9.13%  (p=0.000 n=10+10)
/100000000-4      2.12s ± 2%     2.28s ± 4%    +7.48%  (p=0.000 n=9+10)
/1000000000-4      145s ±35%       45s ±20%   -69.08%  (p=0.000 n=10+10)

name           old alloc/op   new alloc/op   delta
/0-4              0.00B          0.00B           ~     (all equal)
/1-4              8.00B ± 0%    40.00B ± 0%  +400.00%  (p=0.000 n=10+10)
/10-4              320B ± 0%      352B ± 0%   +10.00%  (p=0.000 n=10+10)
/100-4           2.48kB ± 0%    2.13kB ± 0%   -14.19%  (p=0.000 n=10+10)
/1000-4          24.4kB ± 0%    16.7kB ± 0%   -31.39%  (p=0.000 n=10+10)
/10000-4          432kB ± 0%     163kB ± 0%   -62.34%  (p=0.000 n=10+10)
/100000-4        5.45MB ± 0%    1.63MB ± 0%   -70.19%  (p=0.000 n=10+10)
/1000000-4       53.2MB ± 0%    16.3MB ± 0%   -69.45%  (p=0.000 n=10+10)
/10000000-4       504MB ± 0%     163MB ± 0%   -67.73%  (p=0.000 n=10+10)
/100000000-4     5.74GB ± 0%    1.63GB ± 0%   -71.67%  (p=0.000 n=10+10)
/1000000000-4    54.0GB ± 0%    16.3GB ± 0%   -69.89%  (p=0.000 n=9+10)

name           old allocs/op  new allocs/op  delta
/0-4               0.00           0.00           ~     (all equal)
/1-4               1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=10+10)
/10-4              14.0 ± 0%      15.0 ± 0%    +7.14%  (p=0.000 n=10+10)
/100-4              108 ± 0%       107 ± 0%    -0.93%  (p=0.000 n=10+10)
/1000-4           1.01k ± 0%     1.02k ± 0%    +1.09%  (p=0.000 n=10+10)
/10000-4          10.0k ± 0%     10.2k ± 0%    +1.39%  (p=0.000 n=10+10)
/100000-4          100k ± 0%      102k ± 0%    +1.54%  (p=0.000 n=10+10)
/1000000-4        1.00M ± 0%     1.02M ± 0%    +1.56%  (p=0.000 n=10+10)
/10000000-4       10.0M ± 0%     10.2M ± 0%    +1.56%  (p=0.000 n=10+10)
/100000000-4       100M ± 0%      102M ± 0%    +1.56%  (p=0.000 n=10+10)
/1000000000-4     1.00G ± 0%     1.02G ± 0%    +1.56%  (p=0.000 n=9+10)

Now comparing apples to apples, we can see the results are actually (I'm a bit surprised) very similar for all tests, except the last one. Pay attention to that one. That's the 1bi items test. The slice based takes ~3x as much (45s vs 145s) and uses ~3x more memory (54.0GB vs 16.3GB; true across the board for 10k+ items). I strongly encourage you guys to clone and repo and run the tests with large data sets as well to help validate the results. All is need is to add more items to the "tests" variable here.

Having said that, I really like your findings here. I think this test results provide an excellent case study for the design of generics.

But still, impl7 performs really well for small and mid-sized data sets, and crushes, for the lack of a more dramatic word, pure slice based implementations for large data sets. And this will be true regardless of the internal data type because of the queue unique design, not the implementation.

I mean, this is just pure logic. Impl7 never copies data around. Slice based will copy, and if the data set is large, it doesn't matter how fast the copying algorithm is. The copying has to be done and will take some time. Also managing and expanding already gigantic slices in an overloaded system is much slower than managing many small slices.

To help clarify how impl7 is much faster than any slice based approach, here's what happens when you have 1 billion items in a queue and need to expand to accommodate more items.

Slice based implementation

Impl7

And I still stand by my statement that building high performance queues is not an easy job. Building a well balanced queue that will perform well in all scenarios require coming up with design ideas, testing and probing different implementations. As proved again by above bench results, pure slice based implementations doesn't fit the bill.

The uniqueness of the design is in the observation that slices goes to great length to provide predictive, indexed positions. A hash table, for instance, absolutely need this property, but not a queue. So impl7 completely give up this property and focus on what really matters: add to end, retrieve the first. No copying around and repositioned is needed for that. So when a slice goes to great length to provide that functionality, the whole work of allocating new arrays, copying data around is all wasted work. None of that is necessary. And this work costs dearly for large data sets.

Also it's important to highlight that in most cases the queue will likely be used with reference types (most likely user defined structs), not primitive data types such as int. With reference types, impl7 is much faster than impl1 in all scenarios, not just the high load ones. So your test with int is a very interesting one, but not likely a very used one in the real world.

And regarding your recommendation for people to keep building their own slice based implementations, you absolutely need to point out that this recommendation would only display some good results if the queue data type is a primitive one. So please be careful with your statements. They have to be well bounded if you don't want to mislead people.

christianrpetrin commented 5 years ago

And a bit more on the "it's easy" to build a well engineered queue, I'm absolutely sure the author of this queue, which by the way, looks very good and has a very interesting design, is a highly competent engineer. However, his queue has some pretty serious memory leaks. Now imagine not so experienced engineers trying to build high performance queues. Memory leaks and terrible performance in some scenarios are likely to follow.

Merovius commented 5 years ago

Just the internal data type was changed, nothing else.

Just for the record: That is biasing your results towards the more complicated code. I wouldn't wrap an []int into a generic, interface{} based API when using it as a queue, it's useless complexity. But, of course, a more complicated codebase makes more sense to wrap because you have to amortize the implementation cost via reuse.

As I said, the point of my comparison was to pitch a complicated generic implementation against a trivial non-generic one, as the argument was specifically that it's hard to write a performant correct queue. And while comparing apples to apples makes sense if you want to bake apple pie, if all you really want is a tasty desert, there's no harm in considering a wider choice of fruit :)

I also added a few more large data set tests so we can get a better picture of how much it actually cost to copy data around for large data sets.

Out of interest, can you provide some actual numbers from your practical application? It doesn't make sense to compare queues with billions of items, if actual queues never get that large in practice. Except that it again biases your tests towards a more complicated solution, as complicated solutions tend to do better asymptotically, but have larger constants.

Now, of course, biasing a test towards larger numbers isn't inherently less valid than biasing towards smaller numbers. Which is why the question above about what numbers are realistic is so important. We should start from what actual, realistic requirements are and then see what tests make sense to run.

The slice based takes ~3x as much (45s vs 145s) and uses ~3x more memory (54.0GB vs 16.3GB; true across the board for 10k+ items).

Note that this is exclusively because of the generic API. A non-generic slice solution would allocate between 8 and 16GB. And to reiterate, my point was specifically that I don't think the cost of a generic solution (which is exactly this memory and allocation overhead) doesn't justify its benefits.

I mean, this is just pure logic. Impl7 never copies data around. Slice based will copy, and if the data set is large, it doesn't matter how fast the copying algorithm is.

It actually isn't that simple. I'm not contesting that your implementation has - ideally - better asymptotic cumulative behavior. I'm not as sure as you are that it does, but I don't think it's as worth discussing. These copies can be amortized and the generally slower implementation can eat up the wins.

FTR, on my laptop the tipping point with non-generic implementations is around 10M elements (I can't test with significantly more though, because it's a laptop after all :) ). 10M elements per process is a pretty big queue.

As proved again by above bench results, pure slice based implementations doesn't fit the bill.

We obviously disagree on what has and has not been proven.

However, his queue has some pretty serious memory leaks.

FTR, that's not a memory leak. These are generally impossible in Go-code that doesn't use unsafe. And in any realistic system (where both pushes and pops happen), the memory will eventually (or, realistically pretty soon) be released. Note, that your actual arguments for needing the queue to be bounded already create the circumstances that no value put in the queue would be kept around very long.

But also, that queue is another example of an, IMO, unnecessarily complicated queue. I am, still, pitching all of these against using a simple, concretely typed slice and using append for push and q = q[1:] for pop. No extra API. No interface{}. The most simple implementation of a slice-based queue possible. You don't need to be a terribly experienced engineer to implement that bug-free and you don't need to implement it generically, because there are no implementation costs to amortize.

If that performs well enough for pretty much any realistic workload, that is IMO reason enough to leave FIFO-queues out of the standard library.

christianrpetrin commented 5 years ago

@Merovius, I like your sense of humor! :-)

Now with real world use cases, meaning, using the queue with a reference type, impl7 beats every other queue by far in terms of performance and memory. Your targeted example doesn't reflect reality. But still, even with a narrow case impl7 is very competitive performance wise and it still uses only 1/3 of the memory of the slice based implementation.

christianrpetrin commented 5 years ago

Out of interest, can you provide some actual numbers from your practical application? It doesn't make sense to compare queues with billions of items, if actual queues never get that large in practice. Except that it again biases your tests towards a more complicated solution, as complicated solutions tend to do better asymptotically, but have larger constants.

Unfortunately CloudLogger is not ready yet, but if you have read the proposal, you'd see why I believe a truly unbounded queue, an efficient one that is capable of handling large amounts of data is highly desirable. Please make sure to read and understand the entire proposal. There's a lot more to the proposal and the queue than what meets the eye. You absolutely need to read it throughout it if you really want to come up with some proper down arguments on the queue.

Really, you find this code complex?

egonelbre commented 5 years ago

And a bit more on the "it's easy" to build a well engineered queue, I'm absolutely sure the author of this queue, which by the way, looks very good and has a very interesting design, is a highly competent engineer. However, his queue has some pretty serious memory leaks.

I don't quite understand why you consider that a memory leak. The reason for that case is repeated filling and emptying the queue. When you compare repeated filling of the queue (1000x per queue), you can see these numbers:

BenchmarkImpl7/0-8           1000000         1297 ns/op            0 B/op          0 allocs/op
BenchmarkCookiejar/0-8        100000        18722 ns/op        65568 B/op          2 allocs/op
BenchmarkImpl7/1-8             10000       105496 ns/op        48000 B/op       2000 allocs/op
BenchmarkCookiejar/1-8         50000        33237 ns/op        65568 B/op          2 allocs/op
BenchmarkImpl7/10-8             2000       892179 ns/op       600001 B/op      15000 allocs/op
BenchmarkCookiejar/10-8         5000       373064 ns/op       203152 B/op       9004 allocs/op
BenchmarkImpl7/100-8             300      4675674 ns/op      3400008 B/op     107000 allocs/op
BenchmarkCookiejar/100-8         500      3531698 ns/op       923153 B/op      99004 allocs/op
BenchmarkImpl7/1000-8             30     39630023 ns/op     25160058 B/op    1021000 allocs/op
BenchmarkCookiejar/1000-8         50     34455852 ns/op      8123226 B/op     999004 allocs/op
BenchmarkImpl7/10000-8             3    410618766 ns/op    242760405 B/op   10161000 allocs/op
BenchmarkCookiejar/10000-8         3    346521066 ns/op     80188901 B/op    9999006 allocs/op
BenchmarkImpl7/100000-8            1   4582102500 ns/op   2427089840 B/op  101569003 allocs/op
BenchmarkCookiejar/100000-8        1   4849534700 ns/op    801706848 B/op   99999052 allocs/op

Although, I suspect usually it's not used with interface{}, but rather specialized for a particular type. But as you can see it's 2x faster in certain scenarios. I don't know for which data-sizes, values he optimized for, but yeah.

egonelbre commented 5 years ago

As for a faster for larger number of items, here's a quick and dirty version:

BenchmarkImpl7/0-8     5000000000         1.29 ns/op       0 B/op       0 allocs/op
BenchmarkEgon/0-8      5000000000         1.93 ns/op       0 B/op       0 allocs/op
BenchmarkImpl7/1-8       50000000       108 ns/op         48 B/op       2 allocs/op
BenchmarkEgon/1-8         1000000      6626 ns/op      18432 B/op       1 allocs/op
BenchmarkImpl7/10-8       5000000       892 ns/op        600 B/op      15 allocs/op
BenchmarkEgon/10-8        1000000      6882 ns/op      18504 B/op      10 allocs/op
BenchmarkImpl7/100-8      1000000      4708 ns/op       3400 B/op     107 allocs/op
BenchmarkEgon/100-8        500000      8936 ns/op      19224 B/op     100 allocs/op
---
BenchmarkImpl7/1000-8      100000     39952 ns/op      25160 B/op    1021 allocs/op
BenchmarkEgon/1000-8       200000     30922 ns/op      26424 B/op    1000 allocs/op
BenchmarkImpl7/10000-8      10000    409220 ns/op     242760 B/op   10161 allocs/op
BenchmarkEgon/10000-8       10000    330812 ns/op     264312 B/op   10009 allocs/op
BenchmarkImpl7/100000-8      1000   4358965 ns/op    2427087 B/op  101569 allocs/op
BenchmarkEgon/100000-8       1000   3937140 ns/op    2606335 B/op  100097 allocs/op

For counts <1000 it's slower because there it's measuring construction time, rather than actual Pop/Push time. For repeated fills 1000x you should be able to see these results:

BenchmarkImpl7/0-8       3000000          1291 ns/op            0 B/op           0 allocs/op
BenchmarkEgon/0-8        2000000          1921 ns/op            0 B/op           0 allocs/op
BenchmarkImpl7/1-8         50000        108040 ns/op        48000 B/op        2000 allocs/op
BenchmarkEgon/1-8         300000         14591 ns/op        18432 B/op           1 allocs/op
BenchmarkImpl7/10-8         5000        881587 ns/op       600001 B/op       15000 allocs/op
BenchmarkEgon/10-8         20000        278252 ns/op       256320 B/op        9010 allocs/op
BenchmarkImpl7/100-8        1000       4685733 ns/op      3400007 B/op      107000 allocs/op
BenchmarkEgon/100-8         2000       2983277 ns/op      2598344 B/op       99098 allocs/op
BenchmarkImpl7/1000-8        100      40053001 ns/op     25160064 B/op     1021000 allocs/op
BenchmarkEgon/1000-8         200      28794169 ns/op     26000129 B/op      999977 allocs/op
BenchmarkImpl7/10000-8        10     407284450 ns/op    242760470 B/op    10161000 allocs/op
BenchmarkEgon/10000-8         20     299871275 ns/op    259999664 B/op    10008766 allocs/op
BenchmarkImpl7/100000-8        1    4402977600 ns/op   2427088160 B/op   101569000 allocs/op
BenchmarkEgon/100000-8         2    2822037900 ns/op   2600012728 B/op   100096660 allocs/op

I don't think my implementation is particularly good, I suspect there are bunch of ways to improve it, e.g. free blocks for stable queue size. Of course, this would slow down the simpler cases.

Note I changed test such that it uses Pop until false, instead of checking Len every time. The quick & dirty version has a slower Len to make Push and Pop faster.

Merovius commented 5 years ago

but if you have read the proposal, you'd see why I believe a truly unbounded queue, an efficient one that is capable of handling large amounts of data is highly desirable.

I have read the proposal. I disagree with your reasoning. But that's not what I asked. I asked "what are realistic N that you care about?". i.e. assuming an unbounded queue is sensible - how large will it roughly be in practice? As we've established, the answer to that question determines what the best implementation is - an implementation that is better for large N will be worse for small N. So "as big as possible" just isn't a useful answer, because you end up worsening the performance for actual use-cases.

Really, you find this code complex?

Compared to append/q[1:]? Yeah.

bcmills commented 5 years ago

the results are actually (I'm a bit surprised) very similar for all tests, except the last one. Pay attention to that one. That's the 1bi items test. The slice based takes ~3x as much (45s vs 145s) and uses ~3x more memory (54.0GB vs 16.3GB; true across the board for 10k+ items).

Again, I think you are focusing on too narrow a set of benchmarks. You have a solution that performs well for very large queues that are filled and then drained in one long burst, under the assumption that peak memory usage for very large queues is what matters.

It is not at all obvious to me that very large queues filled and drained once are the dominant use-case for queues in general, or that the memory usage of the queue itself is significant (compared to the memory transitively pointed to by the queue's contents). I would want to see evidence to support both of those assumptions before adding such a specifically-optimized implementation to the standard library.

(In contrast, compare sync.Map: it is also specifically optimized for certain modes of use, but those modes of use appear repeatedly within the standard library. If we didn't have standard-library call sites that needed it, sync.Map would be more appropriate as a standalone package.)

christianrpetrin commented 5 years ago

but if you have read the proposal, you'd see why I believe a truly unbounded queue, an efficient one that is capable of handling large amounts of data is highly desirable.

I have read the proposal. I disagree with your reasoning. But that's not what I asked. I asked "what are realistic N that you care about?". i.e. assuming an unbounded queue is sensible - how large will it roughly be in practice? As we've established, the answer to that question determines what the best implementation is - an implementation that is better for large N will be worse for small N. So "as big as possible" just isn't a useful answer, because you end up worsening the performance for actual use-cases.

Really, you find this code complex?

Compared to append/q[1:]? Yeah.

1) Quoting the proposal: "As there's no telling how much data will have to be queued as it depends on the current traffic, an unbounded, dynamically growing queue is the ideal data structure to be used.". This means the memory used is linear to the amount of requests (i.e. active users) the hosting application is handling. Of coarse that even if techniques such as this is used to distribute the memory usage among many devices, under a DDOS attach for instance, the incurred cost to auto scale to a large amounts of nodes in order to be able to hold enormous amount of data (terabytes? petabytes?) is prohibitive. So what CloudLogger will do is state this problem very clearly and recommend the users to specify a hard limit of on much memory they want each CloudLogger backend server to hold as well as how many extra nodes he/she wants to auto scale to. After those user defined, specific for each system/scenario numbers are defined, CloudLogger will diligently cache the rate limit logs up to the specified limits. After that it starts to discard the logs.

2) I think the point you are missing here is that impl7 will perform really well no matter how much memory it has to use because, again, of its design. Not the implementation. So in a way, impl7 is future proof. If, in a few years from now, we have computers with exabyte of memory and some "crazy" engineers decide to build systems that will use that much memory, impl7 will provide the efficiency the engineers need to handle that much data.

3) I think what you are trying to say is that this code is slightly more complicated than this one. But in any way, shape or form, I would call this code complex. This reads to me as a incredible simple code that does an amazing job for what it is.