manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
163 stars 50 forks source link

script hangs when using unbuffered output #269

Closed valerio-marra closed 2 years ago

valerio-marra commented 2 years ago

General information

Issue description

i’m running Corrfunc on a simulation snapshot: i’m computing the angular correlation function in thin shells using DDtheta_mocks. I first installed Corrfunc via pip, then, in order to increase performance, via source: $ git clone https://github.com/manodeep/Corrfunc/ $ make $ make install $ python -m pip install . --user $ make tests

However, while with the pip install the code would parallelize on multiple threads, now it runs mostly on one thread. I’m submitting my SLURM job via:

!/bin/bash

SBATCH --ntasks=1

SBATCH --nodes=1

SBATCH --mem=480000

SBATCH --exclusive

[…] export OMP_NUM_THREADS=48 srun -n 1 python -u $PY_CODE > $LOGS

Expected behavior

To run on 48 threads at ~100%.

Actual behavior

To mostly run on 1 thread. I checked with htop.

What have you tried so far?

To re-install it from source.

Minimal failing example

i’m attaching the log file corrfunc-logs.txt, including the 'make tests'.

lgarrison commented 2 years ago

Hi, @valerio-marra, looking at your log, it seems to build okay (using OpenMP), so maybe it's an affinity issue. Does the behavior change when you specify #SBATCH -c 48? I would have thought that --exclusive would have taken care of that, but you never know... Maybe try passing -c in srun -c 48 -n1 python -u $PY_CODE [...] as well.

Can you also double-check that the OMP_PROC_BIND and OMP_PLACES bash environment variables are unset? Setting OMP_DISPLAY_ENV=TRUE will print their values when the application starts and can help debug.

Does the parallelism work locally, and fail through Slurm? Or does it always run single-threaded?

If the parallelism worked with pip installation but not from source, and you're invoking it with the exact same Slurm script, then it might be an OpenMP library issue, e.g. python is linked against one OpenMP and Corrfunc another. Since you're using Anaconda, you can try to build Corrfunc with Anaconda's compilers instead:

$ conda install gcc_linux-64
$ cd Corrfunc/
$ make distclean
$ CC=x86_64-conda_cos6-linux-gnu-gcc make  # or better yet edit CC in common.mk
$ pip install -e ./

The name x86_64-conda_cos6-linux-gnu-gcc might be different on your platform. I think installing the conda compiler package is actually supposed to alias gcc to the conda compiler; you can check.

valerio-marra commented 2 years ago

hi, thanks! i tried what you suggested and nothing worked. Actually, i assumed the code was working in single thread (and killed it because it was taking too long) but when i call DDtheta_mocks, it just keeps running without doing anything. I reduced the number of particles to 10**4 and it does not produce anything (it takes a few seconds on my laptop).

When i first installed Corrfunc via pip, it was working. I tried to re-install it via pip, but it does not work anymore (when i call DDtheta_mocks, it just keeps running without doing anything). I think the problem is that it is loaded the old system gcc instead of the conda one (that i installed as you suggested).

lgarrison commented 2 years ago

It might be running, but just very very slowly because 48 threads are fighting for one core. If you run it with DDtheta_mocks(..., nthreads=1), does it complete? Adding verbose=True ought to give a progress bar.

I also just realized I got the syntax wrong for the make command, it should be:

$ make CC=x86_64-conda_cos6-linux-gnu-gcc

You may have realized this already if you saw Corrfunc was still building with gcc instead of that long compiler name.

manodeep commented 2 years ago

Thanks @lgarrison

For a one-line solution (may be in modern enough pip versions?), you can use the install-option parameter -- python -m pip install --install-option="CC=x86_64-conda_cos6-linux-gnu-gcc" -e . --verbose

valerio-marra commented 2 years ago

Hi @lgarrison, regarding the compilation, indeed it was using the system gcc, but i edited CC in common.mk. I’m attaching the compilation logs, it gave this warnings: ../common.mk:371: DISABLING AVX-512 SUPPORT DUE TO GNU ASSEMBLER BUG. UPGRADE TO BINUTILS >=2.32 TO FIX THIS. How can I update BINUTILS?

Regarding verbose=True, I’ve been using it and it works on my laptop but when i run the script via slurm it does not show anything. Again, it seems that DDtheta_mocks just keeps running without doing anything.

Regarding OMP_DISPLAY_ENV=TRUE, i’m attaching the logs.

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '48'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'FALSE'
  OMP_PLACES = ''
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'FALSE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
OPENMP DISPLAY ENVIRONMENT END

Regarding running with nthreads=1, same as before: DDtheta_mocks just keeps running without doing anything.

lgarrison commented 2 years ago

I think I'm running out of ideas, other than to try yet more compilers and/or Python stacks. If your cluster has other compilers (e.g. clang, icc, other versions of gcc) available via modules (module load clang...), that would probably be the easiest way to try. Same with different Python environments (e.g. try a clean conda environment, or a non-conda environment if you have module load python).

Another thought: if you have a cluster where the submission nodes might have a different architecture than the compute nodes, make sure you build on the compute nodes.

If you want to confirm that the issue is OpenMP related, you can disable OpenMP support by commenting out OPT += -DUSE_OMP in common.mk.

I wouldn't worry about the binutils bug for now, it's secondary to the code running at all.

@manodeep do you have any ideas?

valerio-marra commented 2 years ago

thanks, @lgarrison , i'll try that (I already tried using a clean conda environment).

Could it be that the uninstalled pip version is still being called? Otherwise, why make tests is successful?

One more thing: you said "_double-check that the OMP_PROC_BIND and OMPPLACES bash environment variables are unset" but it seems that it is OMP_PROC_BIND = 'FALSE'. Is this a problem?

lgarrison commented 2 years ago

Oh, that's true, I didn't read the logs carefully enough! I just assumed the C tests were passing, but it looks like the Python tests are passing too. Maybe the issue is exactly what you suggested, and you're installing in one environment and running in another. Make sure to repeat pip uninstall Corrfunc until no more installations are remaining. Don't run it from inside the Corrfunc source directory. Then reinstall in a fresh environment, and make sure it's loaded when you are running your Python script. Use print(Corrfunc.__file__) to see what installation is being used. (This is all just general advice for managing Python packages, nothing here is specific to Corrfunc.)

OMP_PROC_BIND = 'FALSE' is fine, that's the same as unset.

valerio-marra commented 2 years ago

@lgarrison , @manodeep i found the problem: if i set verbose=False then it works! I was always using verbose=True. Is this a bug or a compilation issue?

lgarrison commented 2 years ago

Wow, that's pretty unusual! Glad it's working. I see you were running with Python in unbuffered mode with python -u $CODE, does verbose=True work if you remove the -u?

I will note we've seen one other instance of verbose causing problems here: #224, but it still seems to be a rare problem.

valerio-marra commented 2 years ago

@lgarrison if i remove -u it does work, although it updates the log file with low cadence and, actually, it does not print the info that verbose=True usually prints, that is, it is as if i set verbose=False. Does verbose=True work only in interactive mode?

Should I fix the binutils bug to increase performance? I'll run my code on hundreds of snapshots.

lgarrison commented 2 years ago

Okay, I think I might understand the root cause here. It's probably this issue: https://github.com/minrk/wurlitzer/issues/20

Specifically, we're filling up some buffer (or perhaps even blocking while trying to do an unbuffered write), but the code that drains the buffer (in Wurlitzer) is at the Python level. And that code can't run because we don't release the GIL when we call into Corrfunc.

I'll need to think about the right way to fix this. Releasing the GIL is probably something we ought to be doing anyway, although it will need to be tested. In addition, it's possible that we're not doing the output redirection in the simplest/most robust way.

On binutils/AVX-512, if you want the extra performance (usually a factor of < 2x), your best bet is if you can find another compiler stack to use, like clang, icc, or a more modern gcc. If one is not readily available, you can try to install one from scratch, although at that point it might not be worth your time! If you're feeling brave here are instructions that worked at least once: https://github.com/manodeep/Corrfunc/pull/196#issuecomment-837241751

lgarrison commented 2 years ago

Oops, actually we are releasing the GIL. In which case I'm not exactly sure what's happening. Will investigate...

lgarrison commented 2 years ago

Hi @valerio-marra, can you please check if PR #270 fixes your issue? Just test your same code on the fix-std-redir branch.

valerio-marra commented 2 years ago

hi @lgarrison, it works! Now the verbose output is printed into the slurm job's standard error.

Regarding binutils/AVX-512, is it necessary to create e new environment? also, shouldn't there be a make before pip in #196 (comment)?

lgarrison commented 2 years ago

It probably has the best chance of success in a new environment (and it has the least chance of disrupting any of your other work that uses an existing environment).

Pip runs make behind the scenes, so an explicit make is not necessary.

lgarrison commented 2 years ago

And thanks very much for confirming the fix! @manodeep, when you have a chance could you please review #270? Then we can close this issue.