open-telemetry / opentelemetry-cpp

The OpenTelemetry C++ Client
https://opentelemetry.io/
Apache License 2.0
890 stars 427 forks source link

TlsRandomNumberGenerator causes memory corruption when the process is forked. #2408

Open ghalliday opened 1 year ago

ghalliday commented 1 year ago

I manage an open source project https://github.com/hpcc-systems/HPCC-Platform which we adding support to open telemetry using this library.

One of the components launches large numbers (100+) of child processes at the same time. Since adding the open telemetry library this is now the system to core when those child processes are started. Here is an example of a stack trace from gdb:

#0  0x00007fc40090eb5d in read () from /usr/lib64/libc.so.6
#1  0x00007fc40089ad54 in __GI__IO_file_underflow () from /usr/lib64/libc.so.6
#2  0x00007fc400899808 in __GI__IO_file_xsgetn () from /usr/lib64/libc.so.6
#3  0x00007fc40088e1ef in fread () from /usr/lib64/libc.so.6
#4  0x00007fc4011baa50 in std::random_device::_M_getval() () from /usr/lib64/libstdc++.so.6
#5  0x00007fc40183995b in opentelemetry::v1::sdk::common::(anonymous namespace)::TlsRandomNumberGenerator::Seed() ()
   from /opt/HPCCSystems/lib/libopentelemetry_common.so
#6  0x00007fc4008e4c4e in fork () from /usr/lib64/libc.so.6
#7  0x00007fc4036aeb69 in CLinuxPipeProcess::run (this=0x7fc352cb63a0)
    at /hpcc-dev/HPCC-Platform/system/jlib/jthread.cpp:2072

I believe the problem is caused by the following code:

  TlsRandomNumberGenerator() noexcept
  {
    Seed();
    platform::AtFork(nullptr, nullptr, OnFork);
  }

where an onFork handler was added inside the random number generator made in 2018.

There are only a very restricted set of functions that are valid to be called at that point within the child process (https://man7.org/linux/man-pages/man3/pthread_atfork.3.html): they must be async-signal-safe, and not use any heap functions because that can cause memory corruption. This onFork call does not obey those restrictions. It also performs unnecessary work whenever you create a child process (I suspect it now takes a long time to start the child processes because of the need to wait for sufficient entropy from the random number generator).

I believe the fix is to delete that call to onFork, and re-examine why that change was made. In particular, why would a process ever call the open telemetry functions inside the child of a fork()?

ghalliday commented 1 year ago

Here is my suggested replacement code (untested):

namespace
{
class TlsRandomNumberGenerator
{
public:
  TlsRandomNumberGenerator() noexcept
  {
    Seed();
  }

  FastRandomNumberGenerator & engine() noexcept { return engine_; }

private:
  FastRandomNumberGenerator engine_;

  void Seed() noexcept
  {
    std::random_device random_device;
    std::seed_seq seed_seq{random_device(), random_device(), random_device(), random_device()};
    engine_.seed(seed_seq);
  }
};

}  // namespace

FastRandomNumberGenerator &Random::GetRandomNumberGenerator() noexcept
{
  static thread_local TlsRandomNumberGenerator random_number_generator{};
  return random_number_generator.engine();
}
github-actions[bot] commented 10 months ago

This issue was marked as stale due to lack of activity.

rpastrana commented 10 months ago

How do we remove the stale label off this jira?