manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
163 stars 50 forks source link

Change OpenMP scheduling to from `dynamic` to `static` #264

Open manodeep opened 2 years ago

manodeep commented 2 years ago

General information

Issue description

OpenMP install with gcc/9.2.0 produces incorrect results when invoked from python. This was originally reported in #197, and then rediscovered in #258

Expected behaviour

The results with OpenMP should be correct when invoked through python

Actual behaviour

The pair counts are nthreads times larger when using the python wrappers. The results are correct when directly using the C static library.

What have you tried so far?

Compiled without OpenMP makes this problem go away

Changing the OpenMP scheduling to static makes the problem less severe - ie, the pair counts are off by a little bit rather than nthreads times. As an added benefit, the code also appears to run marginally faster

Minimal failing example

After installing with the changes in #258, the following fails:

python -m Corrfunc.tests 
manodeep commented 2 years ago

@lgarrison I was thinking of implementing this change asap - what do you think?

lgarrison commented 2 years ago

Not sure... I'm worried that static scheduling will hurt the performance for more clustered cases. I'm also not sure "hiding" the problem with static scheduling is a good idea, even as a stop-gap measure, because we don't really understand all the ways this problem can manifest. I think keeping the problem highly visible and obviously wrong, as it is now, might be preferable.

manodeep commented 2 years ago

Yeah that's the choice - would it be better to be disastrously wrong or just slightly wrong? Disastrously wrong has a greater chance of people noticing but for those that may not notice, the results will be wrong

I will close this then and solve/detect the race condition as we are progressing on #258

lgarrison commented 2 years ago

Speaking of detecting the race condition, maybe it's worth trying the GCC 9.2 thread sanitizer on the Python extension? The memory/address sanitizers were always hard to run because of all the false positives from Python, but perhaps the thread sanitizer will behave better?

manodeep commented 2 years ago

Ohh good idea - I will try that later today

lgarrison commented 2 years ago

Another idea: maybe worth trying OMP_NESTED=true and OMP_NESTED=false? I don't think we're (intentionally) using any nested parallelism, but if somehow we are, it could mess with the thread IDs...

An even simpler version of this train of thought: can we print the thread IDs and check that they are unique?

manodeep commented 2 years ago

I tried omp_set_nested(0) but that did not change anything. Good idea with printing the thread ids - let me try that and I will report back

manodeep commented 2 years ago

Nothing stands out with the printed threads. However, setting OMP_DISPLAY_ENV=verbose does show two OPENMP environments from python but only 1 from C

output from C test -> make tests_bin that passes
======================================

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '4'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'FALSE'
  OMP_PLACES = ''
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'FALSE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
  GOMP_CPU_AFFINITY = ''
  GOMP_STACKSIZE = '0'
  GOMP_SPINCOUNT = '300000'
OPENMP DISPLAY ENVIRONMENT END

output from python test -> python -m Corrfunc.tests that fails
=============================================

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '4'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'FALSE'
  OMP_PLACES = ''
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'FALSE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
  GOMP_CPU_AFFINITY = ''
  GOMP_STACKSIZE = '0'
  GOMP_SPINCOUNT = '300000'
OPENMP DISPLAY ENVIRONMENT END

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201611'
  [host] KMP_ABORT_DELAY='0'
  [host] KMP_ADAPTIVE_LOCK_PROPS='1,1024'
  [host] KMP_ALIGN_ALLOC='64'
  [host] KMP_ALL_THREADPRIVATE='144'
  [host] KMP_ATOMIC_MODE='2'
  [host] KMP_BLOCKTIME='200'
  [host] KMP_CPUINFO_FILE: value is not defined
  [host] KMP_DETERMINISTIC_REDUCTION='FALSE'
  [host] KMP_DEVICE_THREAD_LIMIT='2147483647'
  [host] KMP_DISP_HAND_THREAD='FALSE'
  [host] KMP_DISP_NUM_BUFFERS='7'
  [host] KMP_DUPLICATE_LIB_OK='FALSE'
  [host] KMP_ENABLE_TASK_THROTTLING='TRUE'
  [host] KMP_FORCE_REDUCTION: value is not defined
  [host] KMP_FOREIGN_THREADS_THREADPRIVATE='TRUE'
  [host] KMP_FORKJOIN_BARRIER='2,2'
  [host] KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
  [host] KMP_FORKJOIN_FRAMES='TRUE'
  [host] KMP_FORKJOIN_FRAMES_MODE='3'
  [host] KMP_GTID_MODE='3'
  [host] KMP_HANDLE_SIGNALS='FALSE'
  [host] KMP_HOT_TEAMS_MAX_LEVEL='1'
  [host] KMP_HOT_TEAMS_MODE='0'
  [host] KMP_INIT_AT_FORK='TRUE'
  [host] KMP_ITT_PREPARE_DELAY='0'
  [host] KMP_LIBRARY='throughput'
  [host] KMP_LOCK_KIND='queuing'
  [host] KMP_MALLOC_POOL_INCR='1M'
  [host] KMP_MWAIT_HINTS='0'
  [host] KMP_NUM_LOCKS_IN_BLOCK='1'
  [host] KMP_PLAIN_BARRIER='2,2'
  [host] KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
  [host] KMP_REDUCTION_BARRIER='1,1'
  [host] KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
  [host] KMP_SCHEDULE='static,balanced;guided,iterative'
  [host] KMP_SETTINGS='FALSE'
  [host] KMP_SPIN_BACKOFF_PARAMS='4096,100'
  [host] KMP_STACKOFFSET='64'
  [host] KMP_STACKPAD='0'
  [host] KMP_STACKSIZE='8M'
  [host] KMP_STORAGE_MAP='FALSE'
  [host] KMP_TASKING='2'
  [host] KMP_TASKLOOP_MIN_TASKS='0'
  [host] KMP_TASK_STEALING_CONSTRAINT='1'
  [host] KMP_TEAMS_THREAD_LIMIT='36'
  [host] KMP_TOPOLOGY_METHOD='all'
  [host] KMP_USER_LEVEL_MWAIT='FALSE'
  [host] KMP_USE_YIELD='1'
  [host] KMP_VERSION='FALSE'
  [host] KMP_WARNINGS='TRUE'
  [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
  [host] OMP_ALLOCATOR='omp_default_mem_alloc'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEBUG='disabled'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_AFFINITY='FALSE'
  [host] OMP_DISPLAY_ENV='VERBOSE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS: value is not defined
  [host] OMP_PLACES: value is not defined
  [host] OMP_PROC_BIND='false'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='8M'
  [host] OMP_TARGET_OFFLOAD=DEFAULT
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_TOOL='enabled'
  [host] OMP_TOOL_LIBRARIES: value is not defined
  [host] OMP_WAIT_POLICY='PASSIVE'
  [host] KMP_AFFINITY='noverbose,warnings,respect,granularity=core,none'
OPENMP DISPLAY ENVIRONMENT END

Another weird thing that I noted is that even though the make tests_bin and the python -m Corrfunc.tests have the same number of bins, the total number of cell-pairs created is different -- 359833 cell pairs when called from python, and 360051 from C (through make). This could be why the static scheduling only fails mildly, rather than spectacularly as with dynamic scheduling.

I have also discovered that there is a race condition with numdone that results in numdone > num_cell_pairs. This race condition does not change even if I change the numdone update from atomic to critical.

JEEZ

manodeep commented 2 years ago

Another weird thing that I noted is that even though the make tests_bin and the python -m Corrfunc.tests have the same number of bins, the total number of cell-pairs created is different -- 359833 cell pairs when called from python, and 360051 from C (through make). This could be why the static scheduling only fails mildly, rather than spectacularly as with dynamic scheduling.

Solved this one - the C tests do not enable_min_sep_opt whereas the python tests do. When I set options.enable_min_sep_opt = 1 within test_periodic_lin.c, then I get the 359833 cell-pairs.

manodeep commented 1 year ago

I did some further reading (can't find the reference - the name looked familiar - so someone in the open-source python ecosystem) and it is possible that the OpenMP issues are occurring because OpenMP is linked while the static library is generated (i.e., libcountpairs.a etc) and those static libraries are then again linked into the dynamic C extension shared library. Might be worthwhile to create a dynamic library for the Corrfunc pair-counters in the first place (or use a dynamic library when linking to the C extensions).

manodeep commented 10 months ago

Creating shared libraries (instead of static libraries) did not resolve the issue of multiple OpenMP runtime libraries (libgomp from compiling with GCC and libiomp5 from mkl within numpy). However, setting "export MKL_THREADING_LAYER=GNU" resolved the problem entirely.

Source - https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md (This isn't the article I was referring to in the previous comment)

manodeep commented 10 months ago

Setting that environment variable with os.environ does not work. Though I am not sure I understand why that is, since the documentation says all child processes should inherit the new environment variable. I added a small snippet of code with DD to test whether the OMP runtime environment was working correctly:

#if defined(_OPENMP)
#include <omp.h>

int test_omp_duplicate_runtime_lib_DOUBLE(const int64_t N);

int test_omp_duplicate_runtime_lib_DOUBLE(const int64_t N)
{
    int64_t correct_sum=0.0;
    for(int64_t i=0;i<N;i++) {
        correct_sum += i;
    }

    int64_t omp_sum=0.0;
#pragma omp parallel for schedule(dynamic) reduction(+:omp_sum)
    for(int64_t i=0;i<N;i++) {
        omp_sum += i;
    }
    if(omp_sum != correct_sum) {
        fprintf(stderr,"Basic OMP reduction is causing errors. Expected sum = %"PRId64" sum with OpenMP + reduction = %"PRId64"\n",
                correct_sum, omp_sum);
        return EXIT_FAILURE;
    }
    fprintf(stderr,"Correct result returned\n");
    return EXIT_SUCCESS;
}
#endif

For the case with duplicate OMP libraries, this test code errors with schedule(dynamic) but seems to work fine with schedule(static).

Separately, when setting MKL_THREADING_LAYER=GNU outside within the terminal before invoking Corrfunc (python call_correlation_functions.py) still seems to produce slightly incorrect (but not wildly incorrect) results for the final 2-3 bins.