gnuradio / gnuradio

GNU Radio – the Free and Open Software Radio Ecosystem
https://gnuradio.org
GNU General Public License v3.0
5.12k stars 1.91k forks source link

Calling block::set_processor_affinity() will cause program to crash after the block was stopped. #7528

Open EliteWeapon opened 2 weeks ago

EliteWeapon commented 2 weeks ago

What happened?

Calling set_processor_affnity() on a block which was stopped will cause progam to crash.

System Information

OS: Linux Distro GR Installation Method: Source

GNU Radio Version

3.10 (maint-3.10)

Specific Version

3.10.11.0

Steps to Reproduce the Problem

  1. make a top block with one or more sub-blocks, then start it.
  2. stop the top block for modifying some sub-blocks.
  3. call set_processor_affinity on the top block. this step will cause the program to crash.

Relevant log output

No response

marcusmueller commented 2 weeks ago

Can't reproduce on main with

#!/usr/bin/env python3
from gnuradio import gr
from gnuradio import blocks
from time import sleep

tb = gr.top_block()

src = blocks.null_source(1)
sink = blocks.null_sink(1)

tb.connect(src, sink)

tb.start()

sleep(0.5)

tb.stop()

tb.set_processor_affinity([0, ])

so building 3.10.11.0 from source on Fedora 40 x86_64.

@EliteWeapon any specific info on how it crashes, maybe even a debugger backtrace? Does the above also crash for you?

marcusmueller commented 2 weeks ago

Can't reproduce on v3.10.11.0 either. @EliteWeapon that puts the onus of giving us something minimal that can reproduce the issue, and providing us with more information on your platform, on you, I'm afraid.

EliteWeapon commented 2 weeks ago

I checked the source codes:

  1. When starting a top block, its sub-block detail's memeber 'threaded' and 'thread' will be set to valid values in funtion tpb_thread_body().

    
    tpb_thread_body::tpb_thread_body(block_sptr block,
                                 gr::thread::barrier_sptr start_sync,
                                 int max_noutput_items)
    : d_exec(block, max_noutput_items)
    {
    #if defined(_MSC_VER) || defined(__MINGW32__)
    #include <windows.h>
    thread::set_thread_name(GetCurrentThread(),
                            block->name() + std::to_string(block->unique_id()));
    #else
    thread::set_thread_name(pthread_self(),
                            block->name() + std::to_string(block->unique_id()));
    #endif
    
    block_detail* d = block->detail().get();
    block_executor::state s;
    pmt::pmt_t msg;
    
    d->threaded = true;
    d->thread = gr::thread::get_current_thread_id();
    
    ...
    
    start_sync->wait();
    while (1) {
        boost::this_thread::interruption_point();
        ...
     }
    }
2. Then, we stop the top block. As the result,  the sub-block's underlying thread will be destroyed.

void top_block_impl::stop() { ... if (d_scheduler) d_scheduler->stop(); ... }

`
void scheduler_tpb::stop() { d_threads.interrupt_all(); }
`
3. But the sub-block detail's member '**threaded**' and '**thread**' were left unchanged and now become invalid.
4. Here, we continue to call **block::set_processor_affinity()** on the sub-block and this will cause to further call **block_detail::set_processor_affinity()**:

void block_detail::set_processor_affinity(const std::vector& mask) { if (threaded) { try { gr::thread::thread_bind_to_processor(thread, mask); } catch (std::runtime_error& e) { d_logger->error("set_processor_affinity: invalid mask."); } } }

Because the variable 'thread' is invalid, '**gr::thread::thread_bind_to_processor()**' will cause program to crash.

I found a solution to fix this bug: 
1) add a RAII guard for block detail's thread information
2) the guard's destructor will be called to assure the thread information correct when the block's underlying thread exit.

// guard for thread information of block.detail class block_thread_info_guard { public: block_thread_info_guard(const block_sptr& block) { d_block = block;

    if (d_block) {
        block_detail* d = d_block->detail().get();
        if (d) {
            d->threaded = true;
            d->thread = gr::thread::get_current_thread_id();
        }
    }
}

~block_thread_info_guard()
{
    if (d_block) {
        block_detail* d = d_block->detail().get();
        if (d) {
            d->threaded = false;
            d->thread = gr::thread::INVALID_THREAD_ID;
        }
    }
}

protected: block_sptr d_block; };

tpb_thread_body::tpb_thread_body(block_sptr block, gr::thread::barrier_sptr start_sync, int max_noutput_items) : d_exec(block, max_noutput_items) {

if defined(_MSC_VER) || defined(MINGW32)

include

thread::set_thread_name(GetCurrentThread(),
                        block->name() + std::to_string(block->unique_id()));

else

thread::set_thread_name(pthread_self(),
                        block->name() + std::to_string(block->unique_id()));

endif

block_detail* d = block->detail().get();
block_executor::state s;
pmt::pmt_t msg;

// Use RAII to set and clear block's thread information
block_thread_info_guard guard(block);

...

start_sync->wait();
while (1) {
    boost::this_thread::interruption_point();
    ...
}

}

marcusmueller commented 2 weeks ago

Thanks for the extensive look and proposed solution. Interestingly I still can't make the thing crash, but your reasoning does seem valid.

marcusmueller commented 2 weeks ago

marking as medium (because impact is limited in number of users, but still seems to be serious enough)

marcusmueller commented 2 weeks ago

really wonder why this is not making my machine crash but yours; my guess is that on my machine, the thread lingers on? Or I have a different libc? Can you please give me some insight into specifically which platform you're on?

EliteWeapon commented 1 week ago

@marcusmueller

I wrote a test program in c++ as the following:

#include <gnuradio/types.h>
#include <gnuradio/top_block.h>
#include <gnuradio/blocks/null_source.h>
#include <gnuradio/blocks/null_sink.h>
#include <thread>
#include <chrono>

int main() {

    auto tp_blk = gr::make_top_block("my_top_block");

    auto src_blk = gr::blocks::null_source::make(1);
    auto snk_blk = gr::blocks::null_sink::make(1);

    tp_blk->connect(src_blk, 0, snk_blk, 0);

    tp_blk->start();

    std::this_thread::sleep_for(std::chrono::seconds(1));

    tp_blk->stop();
    tp_blk->wait();

    tp_blk->set_processor_affinity(gr_vector_int{0, 1});

    return 0;
}

The output is:

block_detail :error: set_processor_affinity: invalid mask.
block_detail :error: set_processor_affinity: invalid mask.

The logs above appeared after ' tp_blk->set_processor_affinity(gr_vector_int{0, 1}) ' was executed.

Furtherly, I checked the source code where the messages were logged:

void block_detail::set_processor_affinity(const std::vector<int>& mask)
{
    if (threaded) {
        try {
            gr::thread::thread_bind_to_processor(thread, mask);
        } catch (std::runtime_error& e) {
            d_logger->error("set_processor_affinity: invalid mask.");
        }
    }
}

Because the member 'thread' of block_detail is invalid after block was stopped, an exception was throwed and captured sequentially.

In this demo, progam didn't crash but logged some errors.

EliteWeapon commented 1 week ago

But in another test, I used gqrx-2.17.5 + gr-osmosdr-0.2.6 to simulate the error.

I modified the receiver:stop() function in receiver.cpp as :

void receiver::stop()
{
    if (d_running)
    {
        tb->stop();
        tb->wait(); // If the graph is needed to run again, wait() must be called after stop
        d_running = false;

        /********* next line was newly added  *********/
        tb->set_processor_affinity(gr_vector_int{0, 1});
    }
}

Then I used a filesource block to make the program work. snapshot_0

When I pushed the button to start and then stop the DSP progressing, the program crashed: snapshot

jwp@bh-jwp:~/project_work/gqrx-2.17.5/build/src$ gdb ./gqrx
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04.2) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./gqrx...
(gdb) r
Starting program: /home/jwp/project_work/gqrx-2.17.5/build/src/gqrx 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Warning: Ignoring XDG_SESSION_TYPE=wayland on Gnome. Use QT_QPA_PLATFORM=wayland to run on Wayland anyway.
[New Thread 0x7ffff0fc4640 (LWP 88113)]
[New Thread 0x7fffeb49c640 (LWP 88114)]
[New Thread 0x7fffeac7c640 (LWP 88115)]
[New Thread 0x7fffea442640 (LWP 88116)]
[New Thread 0x7fffe9c41640 (LWP 88117)]
[New Thread 0x7fffe8faf640 (LWP 88118)]
[New Thread 0x7fffcffff640 (LWP 88120)]
[Thread 0x7fffcffff640 (LWP 88120) exited]
gr-osmosdr 0.2.0.0 (0.2.0) gnuradio 3.10.11.0
built-in source types: file rtl rtl_tcp rfspace airspyhf redpitaya 
rx_nb_cc :info: set_min_output_buffer on block 10 to 32768
Resampling audio 96000 -> 48000
[New Thread 0x7fffcffff640 (LWP 88121)]
BandPlanFile is /home/jwp/.config/gqrx/bandplan.csv
BookmarksFile is /home/jwp/.config/gqrx/bookmarks.csv
[New Thread 0x7fffd0bff640 (LWP 88122)]
[Thread 0x7fffd0bff640 (LWP 88122) exited]
[New Thread 0x7fffd0bff640 (LWP 88123)]
[Thread 0x7fffd0bff640 (LWP 88123) exited]
[New Thread 0x7fffd0bff640 (LWP 88124)]
[Thread 0x7fffeac7c640 (LWP 88115) exited]
gr-osmosdr 0.2.0.0 (0.2.0) gnuradio 3.10.11.0
built-in source types: file rtl rtl_tcp rfspace airspyhf redpitaya 
[Thread 0x7fffcffff640 (LWP 88121) exited]
[New Thread 0x7fffcffff640 (LWP 88125)]
[New Thread 0x7fffeac7c640 (LWP 88126)]
[New Thread 0x7fffcd7ff640 (LWP 88127)]
[New Thread 0x7fffccae4640 (LWP 88128)]
[New Thread 0x7fffa8fff640 (LWP 88129)]
[New Thread 0x7fff9bfff640 (LWP 88130)]
[New Thread 0x7fff9b7fe640 (LWP 88131)]
[New Thread 0x7fff9affd640 (LWP 88132)]
[New Thread 0x7fff9a7fc640 (LWP 88133)]
[New Thread 0x7fff99ffb640 (LWP 88134)]
[New Thread 0x7fff997fa640 (LWP 88135)]
[New Thread 0x7fff98ff9640 (LWP 88136)]
[New Thread 0x7fff77fff640 (LWP 88137)]
[New Thread 0x7fff777fe640 (LWP 88138)]
[New Thread 0x7fff76ffd640 (LWP 88139)]
[New Thread 0x7fff767fc640 (LWP 88140)]
[New Thread 0x7fff75ffb640 (LWP 88141)]
[New Thread 0x7fff757fa640 (LWP 88142)]
[New Thread 0x7fff74ff9640 (LWP 88143)]
[New Thread 0x7fff57fff640 (LWP 88144)]
[New Thread 0x7fff577fe640 (LWP 88145)]
[New Thread 0x7fff56ffd640 (LWP 88146)]
[New Thread 0x7fff567fc640 (LWP 88147)]
[Thread 0x7fff567fc640 (LWP 88147) exited]
[Thread 0x7fff56ffd640 (LWP 88146) exited]
[Thread 0x7fff577fe640 (LWP 88145) exited]
[Thread 0x7fff57fff640 (LWP 88144) exited]
[Thread 0x7fff74ff9640 (LWP 88143) exited]
[Thread 0x7fff757fa640 (LWP 88142) exited]
[Thread 0x7fff75ffb640 (LWP 88141) exited]
[Thread 0x7fff767fc640 (LWP 88140) exited]
[Thread 0x7fff76ffd640 (LWP 88139) exited]
[Thread 0x7fff777fe640 (LWP 88138) exited]
[Thread 0x7fff77fff640 (LWP 88137) exited]
[Thread 0x7fff98ff9640 (LWP 88136) exited]
[Thread 0x7fff997fa640 (LWP 88135) exited]
[Thread 0x7fff99ffb640 (LWP 88134) exited]
[Thread 0x7fff9a7fc640 (LWP 88133) exited]
[Thread 0x7fff9affd640 (LWP 88132) exited]
[Thread 0x7fff9b7fe640 (LWP 88131) exited]
[Thread 0x7fff9bfff640 (LWP 88130) exited]
[Thread 0x7fffa8fff640 (LWP 88129) exited]
[Thread 0x7fffccae4640 (LWP 88128) exited]

Thread 1 "gqrx" received signal SIGSEGV, Segmentation fault.
__pthread_setaffinity_new (th=140735785453120, cpusetsize=128, cpuset=0x7fffffffca40) at ./nptl/pthread_setaffinity.c:32
32  ./nptl/pthread_setaffinity.c: 没有那个文件或目录.
(gdb) bt
#0  __pthread_setaffinity_new (th=140735785453120, cpusetsize=128, cpuset=0x7fffffffca40) at ./nptl/pthread_setaffinity.c:32
#1  0x00007ffff5bf749a in gr::thread::thread_bind_to_processor(unsigned long, std::vector<int, std::allocator<int> > const&)
    (thread=140735785453120, mask=std::vector of length 2, capacity 2 = {...}) at /home/jwp/project_work/gnuradio-3.10.11.0/gnuradio-runtime/lib/thread/thread.cc:247
#2  0x00007ffff5b4c74a in gr::block_detail::set_processor_affinity(std::vector<int, std::allocator<int> > const&)
    (this=0x555555b7a7c0, mask=std::vector of length 2, capacity 2 = {...}) at /home/jwp/project_work/gnuradio-3.10.11.0/gnuradio-runtime/lib/block_detail.cc:225
#3  0x00007ffff5b3e412 in gr::block::set_processor_affinity(std::vector<int, std::allocator<int> > const&)
     (this=0x555555d86000, mask=std::vector of length 2, capacity 2 = {...}) at /home/jwp/project_work/gnuradio-3.10.11.0/gnuradio-runtime/lib/block.cc:299
#4  0x00007ffff5ba0b82 in gr::hier_block2_detail::set_processor_affinity(std::vector<int, std::allocator<int> > const&)
    (this=0x555555b2be90, mask=std::vector of length 2, capacity 2 = {...}) at /home/jwp/project_work/gnuradio-3.10.11.0/gnuradio-runtime/lib/hier_block2_detail.cc:894
#5  0x00007ffff5b955ee in gr::hier_block2::set_processor_affinity(std::vector<int, std::allocator<int> > const&)
    (this=0x555555a9e2f0, mask=std::vector of length 2, capacity 2 = {...}) at /home/jwp/project_work/gnuradio-3.10.11.0/gnuradio-runtime/lib/hier_block2.cc:133
#6  0x00007ffff5ba0b82 in gr::hier_block2_detail::set_processor_affinity(std::vector<int, std::allocator<int> > const&)
    (this=0x555555d06980, mask=std::vector of length 2, capacity 2 = {...}) at /home/jwp/project_work/gnuradio-3.10.11.0/gnuradio-runtime/lib/hier_block2_detail.cc:894
#7  0x00007ffff5b955ee in gr::hier_block2::set_processor_affinity(std::vector<int, std::allocator<int> > const&)
    (this=0x555555b89510, mask=std::vector of length 2, capacity 2 = {...}) at /home/jwp/project_work/gnuradio-3.10.11.0/gnuradio-runtime/lib/hier_block2.cc:133
#8  0x000055555569549e in receiver::stop() (this=0x555555ea1200) at /home/jwp/project_work/gqrx-2.17.5/src/applications/gqrx/receiver.cpp:176
#9  0x0000555555684f9d in MainWindow::on_actionDSP_triggered(bool) (this=0x7fffffffdd50, checked=false)
    at /home/jwp/project_work/gqrx-2.17.5/src/applications/gqrx/mainwindow.cpp:1970
#10 0x000055555565fd89 in MainWindow::qt_static_metacall(QObject*, QMetaObject::Call, int, void**)
    (_o=0x7fffffffdd50, _c=QMetaObject::InvokeMetaMethod, _id=62, _a=0x7fffffffd010)
    at /home/jwp/project_work/gqrx-2.17.5/build/src/gqrx_autogen/JZP7SBZKMR/moc_mainwindow.cpp:505
#11 0x00005555556600e0 in MainWindow::qt_metacall(QMetaObject::Call, int, void**) (this=0x7fffffffdd50, _c=QMetaObject::InvokeMetaMethod, _id=62, _a=0x7fffffffd010)
    at /home/jwp/project_work/gqrx-2.17.5/build/src/gqrx_autogen/JZP7SBZKMR/moc_mainwindow.cpp:570
#12 0x00007ffff66f14e5 in  () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#13 0x00007ffff7365be6 in QAction::triggered(bool) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#14 0x00007ffff73688fc in QAction::activate(QAction::ActionEvent) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#15 0x00007ffff746408a in  () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#16 0x00007ffff74641e7 in QAbstractButton::mouseReleaseEvent(QMouseEvent*) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#17 0x00007ffff7560d9e in QToolButton::mouseReleaseEvent(QMouseEvent*) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#18 0x00007ffff73af4ee in QWidget::event(QEvent*) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
--Type <RET> for more, q to quit, c to continue without paging--
#19 0x00007ffff736c713 in QApplicationPrivate::notify_helper(QObject*, QEvent*) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#20 0x00007ffff7374364 in QApplication::notify(QObject*, QEvent*) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#21 0x00007ffff66b9e3a in QCoreApplication::notifyInternal2(QObject*, QEvent*) () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#22 0x00007ffff7372e47 in QApplicationPrivate::sendMouseEvent(QWidget*, QMouseEvent*, QWidget*, QWidget*, QWidget**, QPointer<QWidget>&, bool, bool) ()
    at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#23 0x00007ffff73c8d40 in  () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#24 0x00007ffff73cbfd5 in  () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#25 0x00007ffff736c713 in QApplicationPrivate::notify_helper(QObject*, QEvent*) () at /lib/x86_64-linux-gnu/libQt5Widgets.so.5
#26 0x00007ffff66b9e3a in QCoreApplication::notifyInternal2(QObject*, QEvent*) () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#27 0x00007ffff6b41307 in QGuiApplicationPrivate::processMouseEvent(QWindowSystemInterfacePrivate::MouseEvent*) () at /lib/x86_64-linux-gnu/libQt5Gui.so.5
#28 0x00007ffff6b16a2c in QWindowSystemInterface::sendWindowSystemEvents(QFlags<QEventLoop::ProcessEventsFlag>) () at /lib/x86_64-linux-gnu/libQt5Gui.so.5
#29 0x00007ffff1aebd6e in  () at /lib/x86_64-linux-gnu/libQt5XcbQpa.so.5
#30 0x00007ffff4451d3b in g_main_context_dispatch () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#31 0x00007ffff44a72b8 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#32 0x00007ffff444f3e3 in g_main_context_iteration () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#33 0x00007ffff67130b8 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#34 0x00007ffff66b875b in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#35 0x00007ffff66c0cf4 in QCoreApplication::exec() () at /lib/x86_64-linux-gnu/libQt5Core.so.5
#36 0x00005555556724f7 in main(int, char**) (argc=1, argv=0x7fffffffe0e8) at /home/jwp/project_work/gqrx-2.17.5/src/applications/gqrx/main.cpp:161
(gdb) 

I don't know why the exception was not captured in this test but was captured in above test.

However, the fault when 'set_processor_affinity()' was called after the block was stopped exists.