Use priority queue in ctl::TaskQueue

gavv commented 4 years ago

Problem

ctl::TaskQueue is a thread-safe task queue, allowing to schedule task for immediate or delayed execution.

Tasks for immediate execution are stored in a lock-free FIFO. Tasks for delayed execution are stored in a sorted linked list. Thus, while we have O(1) pop, the insert operation into a sorted list is O(N), which does not scale to big numbers of tasks.

The typical data structure used in such cases is a priority queue implemented using some heap variation.

Solution

Select a priority queue algorithm meeting our requirements.
Add an implementation to roc_core and cover it with unit tests.
Use it instead of core::List in ctl::TaskQueue.
Add benchmarks for schedule_at and reschedule_at and compare results for old and new implementations.

Requirements

Requirements to the priority queue implementation:

It should be intrusive, i.e. it should not perform allocations by itself, but instead should use data embedded into nodes. The same approach is used for core::List, core::MpscQueue, and core::SharedPtr. The limitation of such approach is that a node can be added only to one container, but this is okay for us.
It should have O(1) top/pop.
Ideally, it should have amortized O(1) insert, but O(log n) is also acceptable.
It should have O(1) membership check (check whether the element belongs to the queue). This can be implemented by just storing a pointer to the containing queue in each node, like we do for core::List.
Whenever possible, the implementation should be compact and readable.
It should NOT be thread-safe. Access to it is already serialized by ctl::TaskQueue.

Notes

Other TaskQueue optimizations:

362
380

Ewaolx commented 4 years ago

Hi, it seems like all the variants of priority queue has O(log n) delete operation(pop) and some of the variants has O(1) insert operation. So, does this requires O(1) delete operation?

gavv commented 4 years ago

@Ewaolx Indeed, thanks for noticing. According to this thread, when choosing between various heaps and trees, we'll have to pay O(log N) either for insert() or for pop(), which makes sense.

In our case, ctl::TaskQueue is used to schedule background low-priority work from real-time threads. Hence, if we should choose between faster insert() or faster pop(), insert() is preferred, since it's usually called when a task is scheduled from real-time thread, and pop() is usually called from TaskQueue background thread.

Ewaolx commented 4 years ago

Sounds good. In this case, we may have something like Fibonnaci heap which provides us insert in O(1), and O(log N) for pop operation. Also, I see currently sorted linked list is being used for core:List in ctl::TaskQueue, so that would be replaced by heap and consequently we will have heap insert and delete instead of linked list's, right?

gavv commented 4 years ago

In this case, we may have something like Fibonnaci heap which provides us insert in O(1), and O(log N) for pop operation.

Looks like a good option! Are you going to work on this?

Except theoretical big-O complexity, we should also ensure that the actual task insertion performance wont hurt in the good case, where there are not much pending tasks. So we'll need a small benchmark for TaskQueue schedule.

Quick look at the Fibonacci heap algorithm suggest that it should be okay.

Also, I see currently sorted linked list is being used for core:List in ctl::TaskQueue, so that would be replaced by heap and consequently we will have heap insert and delete instead of linked list's, right?

Yes. IIRC, we'll need to replace three methods: fetch_sleepingtask, insert_sleepingtask, remove_sleepingtask.

Ewaolx commented 4 years ago

Okay, got it. Thanks for the update.

Yes, I can start working on this. I see roc_ctl is under develop branch, and not on master. So are the guidelines same for building and running for develop branch?

gavv commented 4 years ago

Great! Develop branch is quite ahead of master and the guidelines were changed. They're not published on the web site (it's built from master), but you can find them in git:

MBkkt commented 4 years ago

something like Fibonnaci heap

I don't recommend using this algorithm, it's good in theory, but in practice there is a huge constant. I advise you to try a regular binary heap, and only if the performance turns out to be insufficient, you should look for optimizations

gavv commented 4 years ago

@MBkkt thanks for the notice!

I haven't work much with heaps by myself, but quick googling and even Wikipedia article on Fibonacci heap confirms your point.

I feel that Binary heap isn't very well for us because it has O(log n) insertion, while in our specific case insertion speed is the most critical, since tasks are inserted from latency-sensitive threads.

Wikipedia and this page also suggest Pairing heap as a better (faster in practice) alternative to Fibonacci heap. Actually even theoretically it seems better for our case because Pairing heap insertion seems to be O(1) even in the worst case, while Fibonacci heap insertion is only amortized O(1). The implementation of the Paring heap also looks simpler, arguably.

@Ewaolx what do you think? How far did you get on the Fibonacci heap implementation?

MBkkt commented 4 years ago

I understand the concern, but what is the order of the number n? Are you sure the insert will be the bottleneck? In general, given that you want an intrusive queue, probably a binary heap is really not suitable.

Ewaolx commented 4 years ago

@gavv Sorry for late reply, but as you mentioned I also think if we want strictly O(1) amortized for insertion then binary heap would not work. Also, pairing heap sounds like a good option if we are concerned about the Fibonnaci's practical performance. We can go ahead with pairing heap in this case.

Let me know what do you think. Thanks.

gavv commented 4 years ago

I understand the concern, but what is the order of the number n? Are you sure the insert will be the bottleneck? In general, given that you want an intrusive queue, probably a binary heap is really not suitable.

Usually the number of tasks should be low. However, the key point here is that it can grow uncontrollably during load peaks, and at the same time we don't want latency to grow uncontrollably.

We don't optimize every single piece of code, but try to follow design principles which would keep us from ending up with code that we're unable to optimize at all.

One of such principles chosen in this project is that delays in one thread (no matter of their reason) should not cause delays in other threads. Such design decouples threads in terms of latency and delays, so when it comes to performance problems, you can review threads in isolation, which is much much simpler.

TaskQueue is used for communication between pipeline and control threads. If it's not O(1), then delays in control thread will lead to accumulating tasks, thus causing longer insertion times, thus probably causing delays in pipeline threads. And probably not, who knows. You'll have to profile it each time you're debugging glitches.

So it's very attractive just to use O(1) data structure and be 100% sure that TaskQueue insertion always takes fixed time and control thread is guaranteed not to affect pipeline thread. Then, if you hear glitches, you can just profile pipeline thread in isolation.

As for the specific numbers, as mentioned in the issue, it would be nice to add benchmarks and measure insertion times on different queue sizes.

gavv commented 4 years ago

@gavv Sorry for late reply, but as you mentioned I also think if we want strictly O(1) amortized for insertion then binary heap would not work. Also, pairing heap sounds like a good option if we are concerned about the Fibonnaci's practical performance. We can go ahead with pairing heap in this case.

Let me know what do you think. Thanks.

Yeah, let's try a pairing heap then.

ashishsiyag commented 4 years ago

Hey @Ewaolx are you working on pairing heap implementation ?

Hassan-A commented 4 years ago

Hello @gavv , this seems inactive. Mind if I try as well?

gavv commented 4 years ago

Mind if I try as well?

Hi, sure.

divyavit commented 4 years ago

Hi, I was interested in working on it and wanted to give it a try.. has it already been done ?

Hassan-A commented 4 years ago

Hi @divyavit , still in progress.

roc-streaming / roc-toolkit