Non-TBB thread that allocated task for TBB is used by TBB to run its task queue during thread cleanup.

Synopsis

Observed a thread external to TBB - i.e. a thread that was not created by TBB thread pool - to be used to execute TBB task queue. Furthermore, the thread was executing TBB queue when it was being torn down and cleaned up by the operating system, specifically during deallocation of TSD (Thread Specific Data) thread resources, from pthread's __nptl_deallocate_tsd function.

Issue

When non-TBB thread allocates a task, tbb attaches its scheduler to the thread via registration to a TSD slot, along with a deallocator function (e.g. via task's ::allocate() -> governor::local_scheduler_weak() -> theTLS.get() -> governor::init_scheduler_weak() -> generic_scheduler::create_master() -> governor::sign_on() -> governor::assume_scheduler() -> theTLS.set()). Because it is not a tbb pool thread, the thread completes its execution per user-code, and exits its start routine. pthreads then take over and begin processing thread and deallocating its resources, such as tearing down TLS, then TSD, etc. As part of TSD deallocation, pthreads run deallocator functions registered with TSD, which per registration above starts execution of cleanup_master() that eventually gets into thread (or more appropriately semi-dead thread) executing tbb tasks, with tbb essentially appropriating a "used up" thread into its pool.

Below is a stack trace that was recovered showing the half dead thread entering into user code to execute an enqueued task. Note that this is a partial stack with bottom-most functions up to a frame in user code, but no top frame:

[014] 0x47e73c           my_prog              0x47e730           000000000C mp::TbbThreadPool::ThreadPoolTask::execute()
[015] 0x7feaf460ba06     libtbb.so.2          0x7feaf460b4c0     0000000546 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*)
[016] 0x7feaf4608557     libtbb.so.2          0x7feaf4608270     00000002E7 tbb::internal::generic_scheduler::cleanup_master(bool)
[017] 0x7feaf338d409     libpthread.so.0      0x7feaf338d380     0000000089 __nptl_deallocate_tsd.part.5
[018] 0x7feaf338e36d     libpthread.so.0      0x7feaf338e1e0     000000018D start_thread
[019] 0x7feaf30c557f     libc.so.6            0x7feaf30c5579     0000000006 clone

The bad side effect

This bug has caused all kinds of funky with execution of code, but most notable one is it was causing the user code to try to join thread on itself. Consider the following pseudo code that roughly describes the code from which aforementioned stack trace had been collected:

void eventA()
{
   _devices.add(new Device());
}

void eventB (device)
{
   _devices.remove(device);
   delete device;
}

Device::Device()
{
   _worker = std::thread(&Device::workerMain);
}

Device::~Device()
{
    if(_worker) _worker->join();
}

void* Device::workerMain()
{
   init()
   tbb::spawn(task_to_announce_device_online);
   doSomeBusyStuff();
   tbb::spawn(task_to_announce_device_offline);  <--- which generates eventB!
}

The code would occasionally crash due to an exception when calling _worker->join() and the reason for exception turned out to be that id of the joiner thread was the same as id of the joinee, which shouldn't happen for the above code. However, on occasion depending on scheduling of tasks and concurrency outcomes, when worker thread exited its workerMain, it went into __nptl_deallocate_tsd which then called cleanup_master(), which started pumping tasks, and if this thread now picks up the task_to_announce_device_offline (which then calls eventB, which calls ~Device, which calls join()), we now have the case where the worker thread that has completed its start routine is now trying to join the worker thread, which is itself - a violation of std::thread conditions and pthread lib as well.

TBB shouldn't pollute threads it doesn't own, but more importantly, if it needs to attach itself to TSD of non-pool threads, it needs to be very careful not to use them as pool threads, especially if those threads are at the dying stage of their lifecycle.

Task Scheduler does not guarantee that potentially parallel tasks will be executed in parallel. It means that algorithm cannot rely on the assumption that a task will be executed by some concurrent thread. We call this behavior "serial semantics", i.e. the application that uses TBB algorithms can be safely executed with one thread. This behavior simplifies debugging and provides composability opportunities.

One more idea is that (historically) TBB is compute oriented library that considers user thread as a master thread that participates in parallel work together with worker threads on equal terms. Therefore, by default, the thread pool contains one thread less than hardware concurrency to overcome oversubscription.

When a user thread is going to be destroyed, it should cleanup associated resources (e.g. internal data structures). If no one worker thread has not joined the internal data structures yet, the user thread cannot suppose that some worker thread will join sometime and perform cleanup. If the user thread observes some tasks, it is an invalid situation and our last effort is to execute them to prevent resource leaks and possible deadlocks (someone can expect these tasks to be executed). This last effort is almost safe if "serial semantics" is kept.

The example that joins the thread that spawned a task breaks the "serial semantics" idea and requires so called "mandatory concurrency" behavior. It can be achieved with TBB but with other interfaces. As a straightforward replacement, you can consider tbb::task::enqueue interface that implement fire-and-forget semantics. If you create and destroy external threads quite often you may want to consider tbb::task_arena or flow graph interfaces not to reinitialize some internal data structures associated with external threads.

It generally makes sense with couple of exceptions.

I'm not sure exactly what you mean by "if no one worker thread has not joined the internal data structures yet." Since at least one thread had to exist and initialize the thread pool itself, that thread can be made the owning thread of the pool, and when that, and only that thread is ready to be done, it then does clean up of the thread pool. This should perfectly handle the case of single core where only one thread owns the pool, and in case more than one thread (i.e. where TBB created additional threads in case hardware concurrency is greater than one) is participating in the pool, the pool-owning thread can then wait for others to finish jobs.

With respect to "invalid situation and our last effort is to execute them to prevent resource leaks and possible deadlocks (someone can expect these tasks to be executed)... is almost safe," I a bit disagree with that notion, IMHO. It's kinda a well established notion / pattern that client of the library has certain obligation and is expected to code such that avoids certain leaks, finishing before pending operations are complete, etc. In fact, many APIs are purposefully stringent and avoid being "nice" to help ensure that the result of computations is the expected one (e.g. std::thread::join could silently do no-op if thread is not joinable, but instead it vehemently protests by throwing an exception, since the condition likely indicates a flaw in code logic). Hence, I would expect the same from TBB - if threads are gone and work is pending, client screwed up and is likely already in wrongful computation.

However, I think it gets worse in trying to run on thread cleanup. TBB docs state that it can be used with other threading packages, and it is entirely possible that some tasks call into other 3rd party APIs that may be launching own threads and relying TLS statics, etc. Running those tasks on half-dead thread can actually cause silenced faulty computations and thus lead to a very hard to trace issues.

@LBensman is this issue still relevant?

@nofuturre, heh, that's a bit of a 6yo throwback.

It's a bit hard to answer -- following Alexey's answer, specifically the ...

... it is an invalid situation and our last effort is to execute them to prevent resource leaks and possible deadlocks (someone can expect these tasks to be executed). This last effort is almost safe if ... ... part we've decided to ban TBB in our code and simply turned it off in favor of our own simple thread pool implementation (even if lock-based). We haven't looked back in the last 6 years. So from this perspective, it's not relevant -- we just don't use it (even though build setup still supports it and we could switch back to it).

But I think it's still quite relevant from perspective that there's an issue in design whereby it can lead to the behavior as described and filed. IMHO, it is fundamentally a wrong argument that the libraries attempts to continue to run tasks during thread cleanup stage, when thread is considered to be "dead and completed", to give "last effort to execute". It's clearly trying to cover up user's mistake and makes it even harder to troubleshoot the issue -- at the same time not guaranteeing that it won't hit the same user logic flaw, and complete it's cleanup phase with still yet remaining tasks in the queue. All mature and proper API designs check for such conditions, and explicitly fail to indicate to the user that there's a flaw somewhere in the use of the library's API, thus signaling early about a bug in the code (and hence the example of thread join api call throwing b/c thread should never be joining itself, even though I'm sure internally STL could be written to "handle" such a case; STL is explicit insignalling to the user that user has a bug in their code).

We lost a lot of time trying to figure out the odd and unexpected behavior. IIRC, I spent several weeks chasing it, and would up creating my own entire abstraction of std::threading namespace so I could intercept all threading calls, log and trace, to figure out what's going on. It was expensive troubleshooting time, and because of that we lost faith in the library and hence stopped using it. We haven't tried the enqueue() way as supposedly that would've not had the same behavior, but again, we lost faith because even if enqueue() is purportedly is to work that way, overall, the design of the library is in question, and we weren't willing to risk it further, given the pain we already incurred (expecting other surprises).

Has it been fixed in the last 6 years? I don't know. If it has been, then I'd be willing to try it out again, as lock-free threading is more efficient. But the important point here, back to your question, is that from this perspective of sane API behavior, unless it was fixed, is an issue.

Hi @LBensman, I think with the revamp to the oneTBB this issue was fixed. At least I did not see any code that would force user thread to join scheduling loop upon it's destruction. https://github.com/oneapi-src/oneTBB/blob/bad4c42be4623b12ac9ff72138e98775555f9153/src/tbb/governor.cpp#L237 Currently this function is called as part of thread clean-up routine and it is non-blocking. Could you please check if you can still observe the issue on your side?

Could you please check if you can still observe the issue on your side?

Unfortunately I cannot anymore, as I'm no longer with the company 😞 .

oneapi-src / oneTBB