Query: DNNL behavior when application is multithreaded. OMP_NUM_THREADS have an performance impact

avinashcpandey commented 4 years ago

I am running some Tensorflow experiments with and without DNNL using OMP_NUM_THREADS=1 and more. If I set OMP_NUM_THREADS it will have effect on Tensorflow as well as DNNL...both have their own parallel implementation.

I want to run Tensorflow with single thread(OMP_NUM_THREADS =1) and DNNL multithreaded. How do I do that? I want to see how DNNL function performs when Tensorflow is single thread and DNNL is multithreaded

When I set OMP_NUM_THREADS it will effect both. Similar to MKL_NUM_THREADS do we have anything in DNNL to control this?

Environment

Build TF with DNNL Exporting OMP_NUM_THREADS

Actual behavior

Doing good

Expected behavior

Doing good

rsdubtso commented 4 years ago

Tensorflow built with DNNL does not really change its behavior based on OMP_NUM_THREADS, at least in the recent versions:

Here's code from threadpool_device.cc:45 contributed in 2018.

ThreadPoolDevice::ThreadPoolDevice(const SessionOptions& options,
                                   const string& name, Bytes memory_limit,
                                   const DeviceLocality& locality,
                                   Allocator* allocator)
    : LocalDevice(options, Device::BuildDeviceAttributes(
                               name, DEVICE_CPU, memory_limit, locality)),
      allocator_(allocator),
      scoped_allocator_mgr_(new ScopedAllocatorMgr(name)) {
#ifdef INTEL_MKL
#ifdef _OPENMP
  const char* user_omp_threads = getenv("OMP_NUM_THREADS");
  if (user_omp_threads == nullptr) {
    // OMP_NUM_THREADS controls MKL's intra-op parallelization
    // Default to available physical cores
    const int mkl_intra_op = port::NumSchedulableCPUs();
    const int ht = port::NumHyperthreadsPerCore();
    omp_set_num_threads((mkl_intra_op + ht - 1) / ht);
  } else {
    uint64 user_val = 0;
    if (strings::safe_strtou64(user_omp_threads, &user_val)) {
      // Superflous but triggers OpenMP loading
      omp_set_num_threads(user_val);
    }
  }
#endif  // _OPENMP
#endif  // INTEL_MKL
}

Some Python scripts do set intra/inter-op values based on OMP_NUM_THREADS, but this is easily fixable and has nothing to do with DNNL.

avinashcpandey commented 4 years ago

Thanks for the prompt reply. Other thing I am looking for is if I am running Tensorflow with single thread(using this OMP_NUM_THREADS =1) then to run DNNL multithreaded what I need to do? I want to override OMP_NUM_THREADS behaviour for DNNL though something....like we use MKL_NUM_THREADS for BLAS routine with mkl library.

vpirogov commented 4 years ago

@avinashcpandey, I'm not sure I understand the question. Tensorflow built with DNNL has a few knobs that manage threading behavior:

Standard TF controls (inter_op_parallelism_threads and intra_op_parallelism_threads) that affect size of TF's threadpool used by Eigen-based operations and the number of execution threads.
OpenMP controls that affect DNNL behavior (OMP_NUM_THREADS, KMP_AFFINITY). These do not impact what happens to Eigen threadpool.

So I would say for the experiment you want to run you need something like

inter_op_parallelism_threads=1, intra_op_parallelism_threads=1 OMP_NUM_THREADS=N
inter_op_parallelism_threads=1, intra_op_parallelism_threads=N OMP_NUM_THREADS=1

DNNL does not provide a way to control number of threads it uses besides the mechanisms provided by threading runtime.

avinashcpandey commented 4 years ago

Thanks Vadim! I got your point.

oneapi-src / oneDNN