kohler / click

The Click modular router: fast modular packet processing and analysis
Other
740 stars 321 forks source link

Fix issue removing click kernel module. #407

Open neoben opened 6 years ago

neoben commented 6 years ago

The click router threads refuse to die because no one is setting the _stop_flag. There is an infinite loop in RouterThread::driver() function. Call request_stop killing router in master.cc to fix the problem.

tbarbette commented 6 years ago

Shouldn't it be already dying at that point? What sequence of actions lead your router to stay alive?

neoben commented 6 years ago

The loop inside RouterThread::driver() function breaks if __stopflag is set to true:

#if !BSD_NETISRSCHED
        // check to see if driver is stopped
        if (_stop_flag && _master->verify_stop(this))
            break;
#endif

If I'm not wrong, there is just a function setting this flag which is _requeststop() in master.hh:

inline void
Master::request_stop()
{
    for (RouterThread **t = _threads; t != _threads + _nthreads; ++t)
        (*t)->_stop_flag = true;
    // ensure that at least one thread is awake to handle the stop event
    wake_somebody();
}

Without calling _requeststop from _Master::killrouter(Router *router) , when I remove the click kernel module (using rmmod) I receive this kind of error:

[  386.600000] click: current router threads refuse to die!
[  386.610000] click: Following threads still active, expect a crash:
[  386.620000] click:   router thread pid 1916

So if __stopflag is false, RouterThread::driver() is an infinite loop and I suppose this is why _kill_routerthreads() from linuxmodule/sched.cc fails after 5 seconds and several attempts:

static int
kill_router_threads()
{
    delete placeholder_router;
    if (click_router)
    click_router->set_runcount(Router::STOP_RUNCOUNT);

    // wait up to 5 seconds for routers to exit
    unsigned long out_jiffies = jiffies + 5 * HZ;
    int num_threads;
    do {
    MDEBUG("click_sched: waiting for threads to die");
    SOFT_SPIN_LOCK(&click_thread_lock);
    num_threads = click_thread_tasks->size();
    SPIN_UNLOCK(&click_thread_lock);
    if (num_threads > 0)
        schedule();
    } while (num_threads > 0 && jiffies < out_jiffies);

    if (num_threads > 0) {
    printk(KERN_ALERT "click: current router threads refuse to die!\n");
    return -1;
    } else
    return 0;
}

I think this issue is related just to kernel space and it works fine user space. Let me know if this helps or if I'm missing something. Thanks!

tbarbette commented 6 years ago

I'm less familiar with kernel click. However request_stop should be called from adjust_runcount/set_runcount. When you unload the module, click_cleanup_sched is called, which calls kill_router_threads. kill_router_threads set the runcount to STOP_RUNCOUNT using set_runcount, that in turn should set the _stop_flag to true, no?

neoben commented 6 years ago

Yes, but _setruncount with _STOPRUNCOUNT is called just if _clickrouter is defined and it can be undefined (actually this is my case) so there is no function calling _setruncount:

    if (click_router)
    click_router->set_runcount(Router::STOP_RUNCOUNT);
tbarbette commented 6 years ago

I acknowledge the problem, I had it myself. However setting always stop_flag to true seems dangerous. If run_count is still not 0 because some references are still kept, the router should not finish to die.

Maybe you can detail your sequence of event? How did click_router got undefined without run_count being 0? That's the real issue, I think.

neoben commented 6 years ago

I will investigate more about _clickrouter getting undefined.

neoben commented 6 years ago

Anyway I discovered that the change I made was part of the trunk code before and it was removed at some point: https://github.com/kohler/click/commit/85f1315aaecac47eee0a208fbfb2996f0dc7b564#diff-fd6f8a65a460329f21fa8327a96ce179