Performance analysis: thread wait states.

nicelhc13 commented 1 year ago

I tried to understand performance better. As the simplest case, I started from indenpendent tasks. Most of this analysis is from nvtx results:

This is 1000, 16ms task with data movement (I will analyze non-data task later. Since I had baseline of this setting, I just used this) The first interesting thing is all tasks' average times were around 16ms! So we are avoiding GIL effectively. In the past, these were very variable. The second interesting thing is worker threads' wait() time took majority time. It is hard to measure how much this wait() actually affected the time since these were overlapped. But other parts are negligible and worth to take a look.

One possible idea is to define and implement worker thread ready task queue. Our current mechanism is to enqueue a task always to a free worker thread. Instead, how about this? 1) worker threads all are busy 2) a scheduler assigns this to a busy thread 2-1) In this case, this busy thread is internally maintained by a scheduler like a free worker thread queue. In this case, the first item should be the target thread as it was assigned earlier than other threads and it means that it would be highly likely finish its task compared to other threads. 2-2) When the worker thread finishes the old task, it checks the queue before it goes to wait() state. 3) In this case, a scheduler does not pop these tasks who were assigned to busy threads, but first hold. 4) If any free threads become available, then a scheduler pops and assigns to that thread. 5) If the busy threads finishes before other threads become available, then this pop the task from the scheduler's queue and launches it without wait state.

This phase requires more thoughts about concurrency; between scheduler, busy worker threads, and future free threads. But I believe there is a way to do that and will think about that if @dialecticDolt you agree with this high level idea.

I am not sure how much this will affect performance but wait() is truly expensive.

wlruys commented 1 year ago

The correct time to measure isn't wait but the "notify worker" to "worker wake-up" delay. Measuring wait itself doesn't say much bc it encapsulates all of the schedulers behavior (the worker is supposed to be waiting until it is notified and some workers are never notified because they are unused so they have very long wait times)

wlruys commented 1 year ago

But the high level idea of having a worker run it's next task before it drops the GIL might help decrease this delay? It's hard to say because I would increase the amount of time other workers are waiting to start.

nicelhc13 commented 1 year ago

Yes. wait()'s effect is hard to quantify. It is not only GIL, but also wait() function itself. It could cause context switching and that might be critical if a task is very small.

It's hard to say because I would increase the amount of time other workers are waiting to start.

Hmm. I don't get this idea. (maybe we discussed this before but I forgot) It would be great if we can discuss about this.

wlruys commented 1 year ago

Typo!

Meant: It's hard to say because it would increase the amount of time other workers are waiting to start.

Sorry!

nicelhc13 commented 1 year ago

Let me copy and paste the slack msg for me to avoid to forget about this. I think it depends. If my idea is correct, we don't lose benefits of the current using only free thread pool mechanism. My idea is to just put tasks to busy worker threads if no free thread exists. If anyone becomes free, like the current mechanism, we steal that task and assigns to that the free one immediately. Its priority depends on its overhead. The reason why I raised that is b/c our runtime did well on synthetic workloads.

ut-parla / parla-experimental

Performance analysis: thread wait states. #77