kohler / click

The Click modular router: fast modular packet processing and analysis
Other
747 stars 321 forks source link

Race condition with multithread task scheduling #328

Closed bcronje closed 7 years ago

bcronje commented 7 years ago

I've picked up an issue where the following task invariant does not always hold true:

If _is_scheduled && !_is_strong_unscheduled && not on quiescent thread, then on_scheduled_list() || on_pending_list().

The issue can be reliably replicated when configuring userlevel Click with ./configure --disable-linuxmodule --enable-user-multithread --enable-schedule-debugging and running the following configuration:

mt.click

q::ThreadSafeQueue(2);
is::InfiniteSource(\<0800>)
-> rr::RoundRobinSwitch;

rr[0] -> nq1::QuickNoteQueue(1) -> uq1::Unqueue -> q;
rr[1] -> nq2::QuickNoteQueue(1) -> uq2::Unqueue -> q;

q -> uq3::Unqueue -> c1::Counter -> Discard;

StaticThreadSched(is 0,uq1 1, uq2 2, uq3 2);

Script (printn "c1 Bit Rate ",
        printn c1.bit_rate,
        printn " Count ",
        print c1.count,
        wait 2,
        loop);

Running the above configuration with: click mt.click -j3 -p 13000 invariably ends up with no packets being pushed to q, even though there are packets in nq1 and nq2, while uq1 task and uq2 task indefinitely ends up with state _is_scheduled but !on_scheduled_list() && !on_pending_list():

bcronje@ubuntu-dev:~/Projects/conf$ click mt.click -j3 -p 13000 c1 Bit Rate 6448500 Count 577459 Bytes 1154918 c1 Bit Rate 1059000 Count 745463 Bytes 1490926 c1 Bit Rate 1058500 Count 831510 Bytes 1663020 snip c1 Bit Rate 1018500 Count 2081564 Bytes 4163128 c1 Bit Rate 822000 Count 2170619 Bytes 4341238 c1 Bit Rate 0 Count 2203499 Bytes 4406998 c1 Bit Rate 0 Count 2203499 Bytes 4406998

Task states as per read handlers:

telnet 127.0.0.1 13000 snip read uq1.scheduled 200 Read handler 'uq1.scheduled' OK DATA 22 true / not on list / read uq2.scheduled 200 Read handler 'uq2.scheduled' OK DATA 22 true / not on list / read nq1.length 200 Read handler 'nq1.length' OK DATA 1 1 read nq2.length 200 Read handler 'nq2.length' OK DATA 1 1

If I write uq1.scheduled true and write uq2.scheduled true then packet processing resumes again.

At this point I think what happens is that when task A on thread X fires, and is not fast reschuled in the task hook, task A is removed from the scheduled list in RouterThread::run_tasks(), while thread Y is calling task A reschedule() but before thread X has completed the removal from the scheduled list. Which causes thread Y to not add the task to the scheduled list, after which thread X completes the removal of the task from the scheduled list:

From task::reschedule():

  if (thread->current_thread_is_running()
      && _owner->router()->_running >= Router::RUNNING_BACKGROUND) {
      if (!on_scheduled_list()) // <---- At this point task is still on the scheduled list, but about to be removed from RouterThread::run_tasks()
          fast_schedule();
  }

Do you guys agree with my conclusion? Or do you think the issues is somewhere else? Any idea where/how best to address the issue?

bcronje commented 7 years ago

Ok some minor progress on this, it seems to be a regression bug. @ahenning pointed out to me that he was unable to replicate the issue on an older version of Click he is running. After some testing it appears that commit f40363b introduced the bug, as I am unable to replicate it before this commit.

Incidentally it also appears commit f40363b has some performance regression, at least when running the above config:

image

kohler commented 7 years ago

Looking at this…

bcronje commented 7 years ago

Thank you Eddie. Are you able to reproduce on your side?

kohler commented 7 years ago

Yes, I am, and I have what I believe is a fix. Looking at potential performance regression.

kohler commented 7 years ago

On an AWS test machine, the commit before f40363b got ~82000 pps on this config, whereas with d8b39ad we get ~120000 pps. I think this is fixed. :( Thanks for reporting the problem.

kohler commented 7 years ago

Ugh, I think 1d05ca31574ae3b8a21c28c8ea6ec98704f0d553 addressed this problem, I forget why it didn't get merged

bcronje commented 7 years ago

@kohler thank you so much for looking into this. I tested and your fix works perfectly. Race condition is gone and performance back on par, great stuff!