manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
163 stars 50 forks source link

running Corrfunc with nthreads>1 on cluster and some strange results #197

Open zxzhai opened 4 years ago

zxzhai commented 4 years ago

General information

Hi, I installed Corrfunc on a cluster and run some simple tests with the algorithm DDsmu_mocks. When I specified nthreads>1, the values of the resulting pair counts are nthreads time the result using single thread. And the runtime is also nthreads larger. This is very strange and it seems that each thread is running consecutively and processing the full set of points itself instead of splitting the work between threads.

I also test the same code on my laptop and another cluster, and there are no problem. The results of different nthreads are the same and the runtime is also (roughly) nthreads times faster. This implies that the problem only exists on this particular cluster, but I don't understand why if there is anything about the configuration of this cluster impacts the code.

May I ask if any of the developers have similar experience, and any suggestions?

Thanks!

Issue description

Expected behavior

Actual behavior

What have you tried so far?

Minimal failing example

import Corrfunc

# rest of sample code goes here...
lgarrison commented 4 years ago

Thanks for the report! That's pretty strange, I certainly haven't seen Corrfunc do that before. Does the problem only occur for DDsmu_mocks or other estimators as well?

Could you export OMP_DISPLAY_ENV=TRUE as an environment variable in your shell then run your script? That should print out some diagnostic information about the OpenMP setup.

manodeep commented 4 years ago

Thanks @zxzhai for the report. Could you please follow what @lgarrison suggested above. It seems that OpenMP might need to be explicitly enabled at runtime.

When you installed Corrfunc with pip on that cluster, were all the required modules loaded explicitly? Otherwise, the Corrfunc install might have proceeded with the compiler supplied with the OS - and those might not come with OpenMP support.

zxzhai commented 4 years ago

Hi @lgarrison and @manodeep , thanks for the suggestions!

I did two tests and these are what I got:

I tested the code DDrppi_mock from Corrfunc.mocks.DDrppi_mocks, and there was the same problem on that particular cluster.

When I did export OMP_DISPLAY_ENV=TRUE and rerun the code, it gave me following information:

OPENMP DISPLAY ENVIRONMENT BEGIN _OPENMP = '201511' OMP_DYNAMIC = 'FALSE' OMP_NESTED = 'FALSE' OMP_NUM_THREADS = '24' OMP_SCHEDULE = 'DYNAMIC' OMP_PROC_BIND = 'FALSE' OMP_PLACES = '' OMP_STACKSIZE = '0' OMP_WAIT_POLICY = 'PASSIVE' OMP_THREAD_LIMIT = '4294967295' OMP_MAX_ACTIVE_LEVELS = '2147483647' OMP_CANCELLATION = 'FALSE' OMP_DEFAULT_DEVICE = '0' OMP_MAX_TASK_PRIORITY = '0' OPENMP DISPLAY ENVIRONMENT END

OPENMP DISPLAY ENVIRONMENT BEGIN _OPENMP='201611' [host] OMP_CANCELLATION='FALSE' [host] OMP_DEFAULT_DEVICE='0' [host] OMP_DISPLAY_ENV='TRUE' [host] OMP_DYNAMIC='FALSE' [host] OMP_MAX_ACTIVE_LEVELS='2147483647' [host] OMP_MAX_TASK_PRIORITY='0' [host] OMP_NESTED='FALSE' [host] OMP_NUM_THREADS: value is not defined [host] OMP_PLACES: value is not defined [host] OMP_PROC_BIND='false' [host] OMP_SCHEDULE='static' [host] OMP_STACKSIZE='4M' [host] OMP_THREAD_LIMIT='2147483647' [host] OMP_WAIT_POLICY='PASSIVE' OPENMP DISPLAY ENVIRONMENT END

There seems to be some inconsistency, the two _OPENMP are different (201511 vs 201611). I did the same thing on the other computer that has no problem, and the two _OPENMP are the same (both are 201511). So I suspect that this might be the reason. I will check to see if I can fix it.

lgarrison commented 4 years ago

I agree it looks like two different OpenMP libraries are getting loaded at runtime, possibly GNU and Intel (libgomp.so and libiomp5.so), or maybe just two different versions of the same library (i.e. one extension has the RPATH set and one doesn't, so they find different versions). I've also seen this with Anaconda Python, because the Anaconda python executable has the RPATH set to the Anaconda lib directory, which often contains libgomp.so or libiomp5.so, or both! But the Corrfunc compiler doesn't know about that, so it may compile against a different OpenMP library than Anaconda's, and then OpenMP is resolved to Anaconda's at runtime. I think this is supposed to be okay, but it clearly can be harmful if certain OpenMP features are promised at compile time that can't be resolved at runtime. Whether that sort of confusion can also cause multiple OpenMP runtimes to get loaded, I'm not sure.

Static linking of OpenMP by other extensions or executables is another way multiple runtimes could get initialized. But I'm not sure if that would cause duplicate pair counts... although I don't understand how multiple dynamic runtimes would cause that either!

I think I'd recommend what @manodeep suggested: try uninstalling Corrfunc, making sure all your compiler modules are loaded, and then reinstall. If pip doesn't work, try building from source where you can specify the correct compiler manually. And maybe try both inside and outside of Anaconda Python (if you're using that).

zxzhai commented 4 years ago

Thanks for the suggestions. I think I've solved this problem, but don't completely understand why. What I learned is: install Corrfunc from source, don't do pip to install.

I have a .bashrc file indicating another gcc library for another code (not Corrfunc). So I have to switch off all the related setup and this means the that openmp library is just the default on the system. After that I reinstall Corrfunc from source and it seems that the problem is solved. One place to check is "CC :=" in the common.mk file, just in case if some other people meet the same problem in the future.

The place I don't understand is that when this problem is solved (the output doesn't depend on nthreads and the scaling of speed is fine), I also do export OPM_DISPLAY_ENV=TRUE, the output for the two sections are still inconsistent. So it looks like the different versions of openmp or different versions of the same library doesn't impact the result (at least in this scenario), the previous error was caused by something else but unknown, maybe depend on how python uses openmp and in which step the library is called.

lgarrison commented 4 years ago

Thanks for reporting back! This is all really good to know. It's something of a relief that the multiple OpenMP versions aren't clashing, because I don't know how they would have been executing the same parallel region. I think the "inconsistent" OpenMP libraries could easily be coming from different Python packages that were compiled with different RPATH or static linking, which should all be safe.

I'm happy to help if you'd like to dig into the other compiler to try to figure out how it caused this behavior, but otherwise feel free to close the issue.

samotracio commented 4 years ago

Hi, Just sharing that I am facing similar issues here with a very different Fortran code and OpenMP. For some reason, the entire parallel DO loop is executed in each thread when nthreads>1, leading to a compute time equal to nthreads x single_thread_time. This only happens in a virtual cluster when OMP_SCHEDULE is set to other than the default value which is "static", or when a scheduling other than static is specified directly in the DO loop. If OMP_SCHEDULE=auto or static, then it works fine. For reference, I also have exactly the same "conflicting" _OPENMP preprocessor verisions. I also noted the problem only happens in a virtual server running in a 2 CPU machine. In other single CPU machines, the issue does not happen under any circumstance. I my laptop everything runs fine, but in the virtual server the problem appears. For me, scheduling as static seems to work, but I will keep investigating and get back here if something useful surfaces.

manodeep commented 4 years ago

@zxzhai There is no real difference between pip install and git clone + install. Underneath the hood, the compiler is checked, and set as appropriate for the underlying OS (gcc for linux, clang for OSX).

@samotracio Your investigation seems quite relevant. The Corrfunc scheduling is always specified as dynamic, and from @zxzhai's report, the runtime OpenMP seems to be configured for static scheduling.

Christopher-Bradshaw commented 4 years ago

I'm also having, perhaps related, perhaps totally different, problems with parallelization. In my case, setting nthreads appears to have no affect on runtime, and checking CPU usage via htop I never appear to be using more than 1 thread.

I've added logging to make sure that numthreads is properly getting passed through, and _OPENMP is defined (e.g. here https://github.com/manodeep/Corrfunc/blob/d8a795859dc9a66b112c539f82da650fa0b8e586/mocks/DDrppi_mocks/countpairs_rp_pi_mocks_impl.c.src#L260). If I write a toy script using openmp that works, so I'm not sure it is a toolchain problem.

My system:

I'll keep looking, but any suggestions would be appreciated!

lgarrison commented 4 years ago

These problems smack of core affinity issues; i.e. the process affinity mask is set to only execute on one core. This can arise when using OMP_PROC_BIND, which does not appear to be the case here. Another way is when executing on a cluster using Slurm, LSF, or another queue system that launches your job or shell with a specific affinity mask (processes inherit their parent's affinity mask). Sometimes there are "resource binding" flags in the job allocation request that can affect affinity. @Christopher-Bradshaw, are you running on a cluster?

Even if not, a "rogue" Python package could be setting the affinity. Numpy does this, but only when using OMP_PROC_BIND, I think.

Regardless, I would try tracking the core affinity, starting at the C level inside Corrfunc to check if the affinity is actually restricted. Here is a sample program I have used in the past for this purpose:

#include <omp.h>
#include <stdio.h>
#include <sched.h>
#include <assert.h>
#include <stdlib.h>

int main(void){
    // First report the CPU affinity bitmask.
    cpu_set_t mask;
    int cpusetsize = sizeof(cpu_set_t);

    assert(sched_getaffinity(0, cpusetsize, &mask) == 0);

    int naff = CPU_COUNT_S(cpusetsize, &mask);

    printf("Core affinities (%d total): ", naff);
    for (int i = 0; i < CPU_SETSIZE; i++) {
        if(CPU_ISSET(i, &mask))
            printf("%d ", i);
    }
    printf("\n");

    int maxthreads = omp_get_max_threads();
    int nprocs = omp_get_num_procs();

    printf("omp_get_max_threads(): %d\n", maxthreads);
    printf("omp_get_num_procs(): %d\n", nprocs);

    return 0;
}

omp_get_max_threads() and omp_get_num_procs() are useful because the latter will be less than the former if the affinity was restricted at program startup.

If the affinity is actually restricted, then try going one level higher, into Python:

import psutil
print('Python affinity #:', len(psutil.Process().cpu_affinity()))

Place that print statement in a few strategic spots in the code, e.g. at startup, after imports, before running Corrfunc, and after running Corrfunc. See if anything changes.

lgarrison commented 4 years ago

I forgot to mention that to check the affinity of the shell, in Bash one can use:

taskset -c -p $$

where $$ gives you the PID of the shell in Bash.

Christopher-Bradshaw commented 4 years ago

Thanks a lot for the suggestions, I'll give them a try now. I am not running on a cluster, just my local desktop.

lgarrison commented 4 years ago

Another possibility totally unrelated to OpenMP: Corrfunc threads over cell pairs, so if the problem is extremely clustered such that a single cell pair dominates the runtime (e.g. the autocorrelation of a single massive cell), then you will see a burst of multi-threaded activity at the beginning followed by a long period of a single thread running. You can alleviate this somewhat by specifying a larger max_cells_per_dim and possibly bin_refine_factors too. You can use verbose to see what grid size Corrfunc is using.

manodeep commented 4 years ago

I just encountered this issue while running on an interactive node via the slurm queue. The solution was that I had to specify --cpus-per-task to the max-threads I was planning to use. Once I specified that, the taskset command showed the correct cpu affinities. For instance with my job as --cpus-per-task 4, I get the following:

    [~ @john1] taskset -c -p $$
    pid 116541's current affinity list: 16,18,20,30

Before I added the --cpus-per-task, I was submitting with --ntasks 4, and the taskset command always showed one entry, and I could not get Corrfunc to run on multiple threads. (In hindsight, that makes sense -- I was requesting to run four tasks, each with a CPU assigned for it)

Not the solution required, but might solve one class of OpenMP issues.