manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
163 stars 50 forks source link

Cannot run tests on NERSC interactive node #245

Closed Andrei-EPFL closed 3 years ago

Andrei-EPFL commented 3 years ago

General information

Issue description

a) I can run the tests on the login node of Cori and they finish successfully, however if I ask for an interactive node and I try to run the tests they fail (python tests.py). I tried to check at which point they fail and I found out that when it tries to read the catalogs I get the error below. What is even more confusing for me is that if I use the python console I can run this:

from Corrfunc.io import read_catalog x, y, z = read_catalog() print(x) line by line and it works.

b) This error occurs also in one of my codes (again, only on the interactive node) There, I generate at each step a new galaxy catalog and then I use Corrfunc to compute its 2PCF. What happens is that for the first galaxy catalog the Corrfunc provides me the 2PCF, but at the second step, it cannot even obtain the galaxy catalog because I get the error below.

I want to mention that I ask for a haswell interactive node.

Actual behavior

Error in `python': break adjusted to free malloc space: 0x0000010000000000 Aborted

Thank you very much! Regards, Andrei

lgarrison commented 3 years ago

Hi Andrei, can you try the steps outlined in #244 to fix this issue? Specifically:

$ pip uninstall corrfunc
$ module unload gsl
$ module swap PrgEnv-intel PrgEnv-gnu
$ module load gsl
$ pip install corrfunc --no-binary :all:  # a recompile must be triggered to fix the error

As best as we can tell, this is a NERSC environment issue, and that's the workaround we have right now.

Andrei-EPFL commented 3 years ago

Hi Lehman,

Thanks for the fast reply! I should have said this before too: When I installed from the source I changed from PrgEnv-intel to PrgEnv-gnu. (and I have also changed the common.mk: cc=/opt/cray/pe/craype/2.6.2/bin/cc)

As for the steps you suggested: I uninstalled corrfunc and then reinstalled it as you said with: pip install corrfunc --no-binary :all: It still does not work.

Thanks!

lgarrison commented 3 years ago

Thanks, that's good to know. Can you check if any Intel GSL is left in your environment: printenv | grep gsl?

If not, can you try to run the reproducer in #244? Basically, download the C code and the setup.py, build the extension as per Manodeep's comment, and try to import it from Python. This triggers the error for me with base NERSC Python and Intel GSL, but goes away with GNU GSL. And it contains no Corrfunc code, which is why I suspect it's an environment issue!

manodeep commented 3 years ago

I should have said this before too: When I installed from the source I changed from PrgEnv-intel to PrgEnv-gnu. (and I have also changed the common.mk: cc=/opt/cray/pe/craype/2.6.2/bin/cc)

A quick tip that hopefully saves you some keystrokes, you can use a custom compiler by using make CC=/your/compiler

Andrei-EPFL commented 3 years ago

Thanks for the suggestions! There is not intel gsl when I check with printenv | grep gsl.

When I tried to compile the code it said that it cannot find -lgsl so I added "-L/global/common/sw/cray/cnl7/haswell/gsl/2.5/gcc/8.2.0/sr445ay/lib". I also compiled the code with the flag for intel gsl.

Here is what I observed:

On the login node, irrespectively of how I compile the code (intel or gnu) I can import hello (and also build) without any issue. However, when I am on the interactive node, importing "hello" triggers the error, both using gcc and icc (and their respective flags, and with the correct Prg-env). Importing "build" does not trigger the error because it seems to me that it does not use the gsl flags

I understand now that it isn't a corrfunc issue, however, I do not understand why it manifests differently for login and for computing node.

And I still do not know how to fix the problem.

Thanks!

lgarrison commented 3 years ago

Thanks for the report, sorry it isn't working. I'm reopening the issue with NERSC; hopefully they can get to the bottom of it. I've added you on the NERSC issue tracker (I think; let me know if you can't see it!).

lgarrison commented 3 years ago

One workaround that has been working for me is to roll your own Python stack (see https://docs.nersc.gov/development/languages/python/nersc-python/#option-4-install-your-own-python), but I realize this requires a fair amount of legwork! Hopefully the core issue can be resolved soon.

Andrei-EPFL commented 3 years ago

Thanks! I am in the issue tracker.

Andrei-EPFL commented 3 years ago

What bugs me is that until the beginning of march this year I could run corrfunc without any issue. But I wanted to reinstall my environments and do some "cleaning" in my home directory and then the "malloc" error occured.

manodeep commented 3 years ago

@Andrei-EPFL Can you use conda on NERSC? If so, you could install gsl through conda-forge conda install -c conda-forge gsl. May be that will solve the issue.

I am also surprised that gsl-config is not found in the PATH when gsl is loaded. Does NERSC not provide the gsl-config utility that comes standard withgsl?

lgarrison commented 3 years ago

@Andrei-EPFL Can you try to module unload craype-hugepages2M as suggested by NERSC? This fixed the issue in the minimal reproducer for me; you may not even have to recompile.

Andrei-EPFL commented 3 years ago

Hi Lehman, Fortunately, it worked!!!! Thank you very much!

Andrei-EPFL commented 3 years ago

@manodeep I did not check whether installing gsl on conda works (I do use conda, btw). But I can run gsl-config after the gsl module is loaded on NERSC.

lgarrison commented 3 years ago

@Andrei-EPFL Great! I think hugepages was the real problem; GSL just brought it in because it was linked against hugepages, but unloading the module removes the environment variables needed to enable it.

@manodeep NERSC says this is a known issue with Cray hugepages and they don't have a solution, other than to unload the module. Unfortunately, this appears to be a default module on NERSC, so most users can be expected to encounter this error. I'm not sure what the best way to communicate the fix is... do we add detection of NERSC + hugepages to the Python wrapper, and print a warning if it's detected? We could hide such a function in utils.py. Or we could say that's too site-specific to include in Corrfunc, but I think I'm in favor of erring on the side of a better user experience here.

manodeep commented 3 years ago

@Andrei-EPFL Glad that your issue is sorted! Thanks @lgarrison :)

There is not intel gsl when I check with printenv | grep gsl.

When I tried to compile the code it said that it cannot find -lgsl so I added "-L/global/common/sw/cray/cnl7/haswell/gsl/2.5/gcc/8.2.0/sr445ay/lib". I also compiled the code with the flag for intel gsl.

The link-time gsl flags are populated with gsl-config - I was assuming that this link-time error was occurring because gsl-config was not found. If that's not the case, then I am confused what scenario caused that link-time error. Unsure whether this (potential) link-time issue merits any further debugging

manodeep commented 3 years ago

@lgarrison I am conflicted - because to me this issue (at the moment) seems to be restricted to a specific computing resource rather than a widely available tool that might be used any astronomer. Does this bug only occur within python and not when using the pair-counters from the command-line?

lgarrison commented 3 years ago

I totally agree this issue is probably just restricted to NERSC. I do, however, think NERSC is a popular platform for Corrfunc these days, especially with DESI ramping up. I can try to advertise this workaround within DESI, but on the other hand, adding detection to Corrfunc could be as innocuous as:

if 'NERSC_HOST' in os.environ and os.getenv('HUGETLB_DEFAULT_PAGE_SIZE'):
    warnings.warn('Warning: Cray hugepages has a bug that may crash Corrfunc, you might need to disable it, probably with `module unload craype-hugepages2M`')

The command-line counters seem unaffected. It's probably an issue with Python extensions and hugepages (and the fact that hugepages is being dlopen-d after process initialization by the Python import, if I had to guess).

manodeep commented 3 years ago

Would the idea be to create a custom function check_runtime_environment and add this warning (and any future ones) there? Then all the python API calculators call this check_ function? That seems reasonable - the only thing I can think of is that what happens if in the future we run into a system bug that affects the command-line as well? In that case, adding this check into the C-API would be more future-proof. Should we also add in an install-time warning for this?

What do you think about this tweaked error message?

if 'NERSC_HOST' in os.environ and os.getenv('HUGETLB_DEFAULT_PAGE_SIZE'):
    warnings.warn('Warning: Cray hugepages has a bug that may crash Corrfunc. You might be able to fix such a crash with `module unload craype-hugepages2M` (see https://github.com/manodeep/Corrfunc/issues/245 for details)')
lgarrison commented 3 years ago

Yep, I was thinking exactly of something like a check_runtime_environment() function. But I think it would have to live at the Python level; with this bug, just the act of importing any C extension will trigger the error, before we get a chance to call such a C function.

And yes, I think it's good to link the issue in the warning! I'll go ahead and do a PR if we're in agreement.

manodeep commented 3 years ago

Yes we are in agreement about adding the warning - you convinced me with your user experience + DESI argument :)

So the bug doesn't trigger on from Corrfunc.theory import DD but does show up on from Corrfunc._countpairs import countpairs as DD_extn?

lgarrison commented 3 years ago

Yes, that's right!

On Wed, Apr 14, 2021 at 4:29 PM Manodeep Sinha @.***> wrote:

Yes we are in agreement about adding the warning - you convinced me with your user experience + DESI argument :)

So the bug doesn't trigger on from Corrfunc.theory import DD but does show up on from Corrfunc._countpairs import countpairs as DD_extn?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/245#issuecomment-819811173, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLA7S235665NUVSCEJDTY3TIX3MNANCNFSM42TFZ47Q .

-- Lehman Garrison Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

lgarrison commented 3 years ago

Closing this with the warning message implemented in #246.