manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
163 stars 50 forks source link

weird behavior with verbose=True #224

Open kstoreyf opened 4 years ago

kstoreyf commented 4 years ago

General information

Issue description

I have a strange issue running Corrfunc on my cluster, that I can't reproduce on another cluster. I admit it's quite weird but can't figure it out. When I submit a Corrfunc job to my cluster, it sometimes crashes partway and hangs indefinitely. Specifically, it hangs when verbose=True, and when I loop over a high enough number of mocks. The same code works fine on an interactive node.

To be clear: verbose=True, Nmocks=100: hangs verbose=False, Nmocks=100: works verbose=True, Nmocks=10: works

This seems like some sort of memory issue, given the Nmocks trend. I also had a similar issue when messing with the code and building from source, and I found the issue to be some printlines I had added in C that I think were causing a memory leak.

Expected behavior

The script completes successfully.

Actual behavior

The job crashes hangs indefinitely.

What have you tried so far?

Running on an interactive node, which works; running on a different cluster's queue system, which works. Also building the code from source.

Minimal failing example

import Corrfunc

Nmocks = 100
verbose = True #fails when true! works when False.

N = 300
L = 750
r_edges = np.arange(40.0, 210.0, 10.0)
nthreads = 1

for n in range(Nmocks):

        # Generate cubic mock
        x = np.random.rand(N)*float(L)
        y = np.random.rand(N)*float(L)
        z = np.random.rand(N)*float(L)

        res = Corrfunc.theory.xi(L, nthreads, r_edges, x, y, z, verbose=verbose)
manodeep commented 4 years ago

Thanks for the report. Could this be related to #197 ?

lgarrison commented 4 years ago

This is pretty weird! Could you share the job submission script that you're using to run this example? I'm wondering if some kind of stderr/stdout buffer could be filling up (not that I've ever seen that before).

Do you know how many iterations the 100 mock example gets through before it hangs? Does it always hang on the same iteration?

Also, this isn't on the Flatiron cluster by any chance, is it? If it were, I could run the example myself.

kstoreyf commented 4 years ago

i suspected something like a full buffer! yes this example actually hangs at 13 iterations, independent of Nmocks or Nbins.

it's on the sirocco cluster at nyu - actually thanks for pointing out #197 because @zxzhai might be running on the same cluster? zhongxu, would be great if you don't mind testing this on sirocco.

here's the job submission script:

#PBS -N cf_min
#PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=1
#PBS -l mem=100MB #changing this doesn't seem to do anything, strangely 
#PBS -o output/$PBS_JOBNAME.out
#PBS -j oe

cd $PBS_O_WORKDIR
python cf_minimal.py
lgarrison commented 4 years ago

So if you put a Python print statement in the loop, it prints 13 times? Do you get 13 chunks of verbose Corrfunc output too? The Corrfunc output comes from the C level, so it would be good to check that print statements from both the C and Python level are making it through.

I'm wondering if the wurlitzer package we use to capture output from the C extension could be causing trouble somehow?

On Tue, Jun 30, 2020 at 1:26 PM Kate Storey-Fisher notifications@github.com wrote:

i suspected something like a full buffer! yes this example actually hangs at 13 iterations, independent of Nmocks or Nbins.

it's on the sirocco cluster at nyu - actually thanks for pointing out #197 https://github.com/manodeep/Corrfunc/issues/197 because @zxzhai https://github.com/zxzhai might be running on the same cluster? zhongxu, would be great if you don't mind testing this on sirocco.

here's the job submission script:

PBS -N cf_min#PBS -l walltime=40:00:00#PBS -l nodes=1:ppn=1#PBS -l mem=100MB #changing this doesn't seem to do anything, strangely #PBS -o output/$PBS_JOBNAME.out#PBS -j oe

cd $PBS_O_WORKDIR python cf_minimal.py

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/224#issuecomment-651935342, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLA7SY5473BLAWRRND4UJLRZIN5XANCNFSM4OL6VKCQ .

-- Lehman Garrison lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

kstoreyf commented 4 years ago

so the other strange/frustrating thing is that when it hangs, i have to cancel the job, and then the output file is empty (or if i print things about the nodes, it has that, but no printlines from the code itself). i figured out the 13 by appending to file each iteration. maybe i could also redirect the output to a file or something?

when i run on interactive node, i see both python and c-level Corrfunc output.

oh when i was building from source earlier i actually did have some trouble with wurlitzer's syspipes....

lgarrison commented 4 years ago

If it's convenient for you to run Corrfunc from source, could you try a few wurlitzer-related things? Specifically, in utils.py, there's this snippet in sys_pipes(): https://github.com/manodeep/Corrfunc/blob/75b8ac98cc5cc76a5d8aa95715937194cec8ea15/Corrfunc/utils.py#L1044-L1055

Could you print the values of sys.stdout.isatty() and sys.stderr.isatty()?

Could you also try adding raise in the except clause? Maybe there's an exception being masked.

kstoreyf commented 4 years ago

could you clarify the syntax for adding raise? want to make sure i'm doing it correctly.

a) on the interactive node, both sys.stdout.isatty() and sys.stderr.isatty() are true. b) when running Nmocks=10 and verbose=True, both are false. c) when running Nmocks=20 (>13) and verbose=False, both are false. d) when running Nmocks=20 and verbose=True, can't see the output!

for case b), verbose output from the C code is printed - but all before the python output, even though it should be interspersed between the python for the realizations (as happens properly in case a)).

lgarrison commented 4 years ago

Okay, it sounds like the TTY detection is working properly. The interspersed output is likely just due to buffering and is not unusual. Adding flush=True as an argument to print() would probably fix this, but it's probably not related to the real problem.

For the exception, you can just do "raise" as the first line in the "except" block, and Python will figure out to re-raise the exception that was caught by the current block.

On Tue, Jun 30, 2020 at 4:08 PM Kate Storey-Fisher notifications@github.com wrote:

could you clarify the syntax for adding raise? want to make sure i'm doing it correctly.

a) on the interactive node, both sys.stdout.isatty() and sys.stderr.isatty() are true. b) when running Nmocks=10 and verbose=True, both are false. c) when running Nmocks=20 (>13) and verbose=False, both are false. d) when running Nmocks=20 and verbose=True, can't see the output!

for case b), verbose output from the C code is printed - but all before the python output, even though it should be interspersed between the python for the realizations (as happens properly in case a)).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/224#issuecomment-652016286, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLA7S6VEMQE3EXUEJFEEYLRZJA5LANCNFSM4OL6VKCQ .

-- Lehman Garrison lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

kstoreyf commented 4 years ago

ok thanks that's what i was doing, it's not changing anything when i add the "raise" line. in the cases i can see the output, it's not reaching the except block (can see from adding printlines).

and yes i've seen the interspersing before, probably not the issue here.

lgarrison commented 4 years ago

Okay, not seeing a problem yet.

The following is a little paranoid, but easy to check. Could you pip install psutil and place the following in the Wurlitzer context:

p = psutil.Process() print(p.num_fds())

Aside from that, there are a few places in the code where the verbose flag could, in theory, lead to an early exit, like this one: https://github.com/manodeep/Corrfunc/blob/master/theory/xi/countpairs_xi_impl.c.src#L115-L129

Or if we had a bad printf formatting, that could lead to out-of-bounds writes and undefined behavior.

There's also the progress bar stuff. I'm not familiar enough with that part of the code to know whether it's a risk.

On Tue, Jun 30, 2020 at 4:41 PM Kate Storey-Fisher notifications@github.com wrote:

ok thanks that's what i was doing, it's not changing anything when i add the "raise" line. in the cases i can see the output, it's not reaching the except block (can see from adding printlines).

and yes i've seen the interspersing before, probably not the issue here.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/224#issuecomment-652031581, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLA7S7V3FFJRB6OO4UCYHTRZJEZHANCNFSM4OL6VKCQ .

-- Lehman Garrison lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

kstoreyf commented 4 years ago

assuming you mean the wurlitzer context you quoted above. for the submitted job Nmocks=14 verbose=True, output of num_fds() is:

n=0: 14 n=1: 15 ... n=11: 25 n=12: 15 (suspiciously close to when it fails on 13) n=13: 16

if i more mocks with verbose=False, it keeps increasing then jumps back to 16 at n=31, and so on (with slightly diff mins and maxs).

i see this is the number of file descriptors opened, not sure what this means tho!

manodeep commented 4 years ago

Within the C code, everything is printed to stderr, except for the final result that might be written out from a command-line executable. That means nothing is printed out to stdout if you are using the python interface.

I am surprised that num_fds is not constant. Doesn't that mean there is resource leak occurring somewhere?

manodeep commented 4 years ago

The progressbar code should be okay (just looked through the code), but it has string handling - so who knows!

lgarrison commented 4 years ago

I'm not sure about the number of fds. I guess it is a leak, but this is way under the default max number of fds, which I think is 1024. But maybe this cluster has a lower limit? I will have to look up the query syntax...

On Tue, Jun 30, 2020 at 6:33 PM Manodeep Sinha notifications@github.com wrote:

The progressbar code should be okay (just looked through the code), but it has string handling - so who knows!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/224#issuecomment-652081083, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLA7S23FW33MQJAPO2J2CDRZJR4HANCNFSM4OL6VKCQ .

-- Lehman Garrison lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

kstoreyf commented 4 years ago

thanks both! let me know if you have other ideas i can test. for now i can just turn off verbose, but it took me a while to diagnose, so maybe having this here will help someone else!