running problems when using DDsmu suit

zgli4github commented 4 years ago

General information

Corrfunc version: 2.3.1
platform: SUSE Linux Enterprise Server 11 (x86_64)
installation method (pip/source/other?): source
Issue description

When running DDsmu with data set "mock0001_corrfun.dat" and "random20M_corrfunc.dat", it does not work and show segment fault error. Here is the information: "0%./run_DDsmu: line 1: 94599 Segmentation fault (core dumped) "

Expected behavior

Actual behavior

Running `DDsmu' with the parameters

file1 = mock0001_corrfunc.dat format1 = a file2 = random20M_corrfunc.dat format2 = a sbinfile = sbin.dat mu_max = 1.0 nmu_bins = 100 Nthreads = 10 weight_method = pair_product weights_file1 = mock0001_w.dat weights_format1 = a weights_file2 = random20M_w.dat weights_format2 = a

ND1 = 919915 [xmin,ymin,zmin] = [-1874.833740,-1708.027466,-111.888260], [xmax,ymax,zmax] = [-61.387409,1622.854370,1739.705444] ND2 = 20000000 [xmin,ymin,zmin] = [-1877.258423,-1728.227783,-116.989632], [xmax,ymax,zmax] = [-54.324726,1637.053955,1746.708740] Running with points in [xmin,xmax] = -1877.258423,-54.324726 with periodic wrapping = 1822.933716 Running with points in [ymin,ymax] = -1728.227783,1637.053955 with periodic wrapping = 3365.281738 Running with points in [zmin,zmax] = -116.989632,1746.708740 with periodic wrapping = 1863.698364 In gridlink_float> Running with [nmesh_x, nmesh_y, nmesh_z] = 12,22,6. Time taken = 0.084 sec In gridlink_float> Running with [nmesh_x, nmesh_y, nmesh_z] = 12,22,6. Time taken = 2.866 sec Using AVX kernel 0%./run_DDsmu: line 1: 94599 Segmentation fault (core dumped) DDsmu mock0001_corrfunc.dat a random20M_corrfunc.dat a sbin.dat 1.0 100 10 pair_product mock0001_w.dat a random20M_w.dat a > Out.dat

What have you tried so far?

I have tried DD using the same data set. DD works.
I have tried DDsmu using data set "mock0001_corrfunc.dat" and "mock0001_corrfunc.dat", i.e. the same data for correlation. It does not work and show the same error as before.

Minimal failing example

DDsmu mock0001_corrfunc.dat a random20M_corrfunc.dat a sbin.dat 1.0 100 10 pair_product mock0001_w.dat a random20M_w.dat a > Out.dat

import Corrfunc

# rest of sample code goes here...

lgarrison commented 4 years ago

Thanks for the report! Have you been able to get DDsmu to work at all, e.g. without weights?

We might need your data files to reproduce the crash. Are they small enough/would you mind sharing them?

zgli4github commented 4 years ago

Sure I am very glad to upload the data. I have tried run DDsmu without weights, it still does not work. mock0001_small.txt mock0001_wsmall.txt random20M_small.txt random20M_wsmall.txt

zgli4github commented 4 years ago

I am making a mistake to close the issue. Sorry!

lgarrison commented 4 years ago

Thanks for the files. I can't reproduce the issue, unfortunately. I tried to invoke the code exactly as you did:

$ ~/corrfunc/bin/DDsmu mock0001_small.txt a random20M_small.txt a sbin.txt 1.0 100 10 pair_product mock0001_wsmall.txt a random20M_wsmall.txt a > out.txt
Running `/mnt/home/lgarrison/corrfunc/bin/DDsmu' with the parameters 

         -------------------------------------
         file1      = mock0001_small.txt 
         format1    = a 
         file2      = random20M_small.txt 
         format2    = a 
         sbinfile   = sbin.txt 
         mu_max     = 1.0 
         nmu_bins   = 100 
         Nthreads   = 10 
         weight_method = pair_product 
         weights_file1 = mock0001_wsmall.txt 
         weights_format1 = a 
         weights_file2 = random20M_wsmall.txt 
         weights_format2 = a 
         -------------------------------------
ND1 =        10000 [xmin,ymin,zmin] = [-1355.554688,-1135.312256,-75.033646], [xmax,ymax,zmax] = [-468.245331,1040.746094,926.566895]
ND2 =        10000 [xmin,ymin,zmin] = [-1856.438599,-1690.313354,-99.900909], [xmax,ymax,zmax] = [-80.544281,1541.329346,1714.528564]
Running with points in [xmin,xmax] = -1856.438599,-80.544281 with periodic wrapping = 1775.894287
Running with points in [ymin,ymax] = -1690.313354,1541.329346 with periodic wrapping = 3231.642578
Running with points in [zmin,zmax] = -99.900909,1714.528564 with periodic wrapping = 1814.429443
In gridlink_float> Running with [nmesh_x, nmesh_y, nmesh_z]  = 7,12,3.  Time taken =   0.002 sec
In gridlink_float> Running with [nmesh_x, nmesh_y, nmesh_z]  = 7,12,3.  Time taken =   0.011 sec
Using AVX kernel
0%.........10%.........20%.........30%.........40%.........50%.........60%.........70%.........80%.........90%.........100% done. Time taken =  0.023 secs
DDsmu> Done -  ND1=       10000 ND2=       10000. Time taken =   0.09 seconds. read-in time =   0.06 seconds pair-counting time =   0.04 sec

I didn't have your sbin.dat file, so I made up my own. Could you share that too?

Also, do the files you uploaded crash the code? I see they have different names than in your command line invocation.

zgli4github commented 4 years ago

Sure. Thanks for your test. sbin.txt

zgli4github commented 4 years ago

I don't think so. The files I upload is a mini-version of my data, so I give them "small" in the file name.

lgarrison commented 4 years ago

I think the problem is the large number of bins (3000). Obviously, Corrfunc should not crash, but in the meantime if you need to run the code, try reducing the number of bins to < 100.

manodeep commented 4 years ago

@zgli4github Are you able to run this with valgrind? If so could you please run the following command in a terminal:

$ valgrind -v --log-file=ddsmu_valgrind.log --leak-check=full --show-reachable=yes --track-origins=yes ./DDsmu mock0001_corrfunc.dat a random20M_corrfunc.dat a sbin.dat 1.0 100 10 pair_product mock0001_w.dat a random20M_w.dat a

and then please attach the generated log file (ddsmu_valgrind.log)?

@lgarrison Did DDsmu crash for you with the 3000 bins? I wonder if this is because of a stack-size limit that is getting exceeded. The number of bytes required for the histogram would be 3000*100*8 ~ 2MiB.

lgarrison commented 4 years ago

Yes, it crashed with 3000 bins. It seems very likely it's a stack overflow, we should probably be heap allocating anything scaled to the number of bins.

On Sat, Nov 9, 2019 at 9:57 PM Manodeep Sinha notifications@github.com wrote:

@zgli4github https://github.com/zgli4github Are you able to run this with valgrind? If so could you please run the following command in a terminal:

$ valgrind -v --log-file=ddsmu_valgrind.log --leak-check=full --show-reachable=yes --track-origins=yes ./DDsmu mock0001_corrfunc.dat a random20M_corrfunc.dat a sbin.dat 1.0 100 10 pair_product mock0001_w.dat a random20M_w.dat a

and then please attach the generated log file (ddsmu_valgrind.log)?

@lgarrison https://github.com/lgarrison Did DDsmu crash for you with the 3000 bins? I wonder if this is because of a stack-size limit that is getting exceeded. The number of bytes required for the histogram would be 30001008 ~ 2MiB.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/204?email_source=notifications&email_token=ABLA7S6UGMD5GP4D7K3ZQHDQS5Z73A5CNFSM4JKQ4XM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDUUB4Y#issuecomment-552157427, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLA7S77HXB24YRAFLEOOULQS5Z73ANCNFSM4JKQ4XMQ .

-- Lehman Garrison lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

manodeep commented 4 years ago

@zgli4github I can also reproduce the crash on my OSX laptop. Using trial and error, I see that the crash occurs for nsbins >= 2535. I can confirm that this is due to the default stack size being insufficient for such a large number of bins. On my OSX laptop, I can fix this crash by changing the default stack size using ulimit -s 65532. That command sets the stack-size to 64 kB (mine was set to 8 kB, which you can see with ulimit -s). On a linux environment, the relevant command is ulimit -s unlimited, which will allow an unlimited stack-size.

Please report back if that does not fix the crash.

zgli4github commented 4 years ago

Thanks very much! It can work when using a smaller number of bins. When I set ulimit -s unlimited on a linux environment, I still fail to run DDsmu when using 3000 bins.

zgli4github commented 4 years ago

Ok. Set ulimit -s 65532 is ok to work with large number of bins.

manodeep commented 4 years ago

Fantastic! @zgli4github Are you okay with us closing the issue?

@lgarrison Should we consider switching to a different strategy for the bins? EIther pre-allocating from the main API, or allocating within the kernel?

lgarrison commented 4 years ago

Yes, I think allocations that are scaled to the number of bins should be heap allocations, not stack allocations. Not sure where the best place to make the allocations is, we'll need to audit the code to see where we've used this pattern.

On Mon, Nov 11, 2019 at 3:06 PM Manodeep Sinha notifications@github.com wrote:

Fantastic! @zgli4github https://github.com/zgli4github Are you okay with us closing the issue?

@lgarrison https://github.com/lgarrison Should we consider switching to a different strategy for the bins? EIther pre-allocating from the main API, or allocating within the kernel?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/204?email_source=notifications&email_token=ABLA7SY3ASZMNH4IYOL73GLQTG3MHA5CNFSM4JKQ4XM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDX6T7Q#issuecomment-552593918, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLA7S2VA7R42MS6Z5G4CUTQTG3MHANCNFSM4JKQ4XMQ .

-- Lehman Garrison lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

manodeep commented 4 years ago

@zgli4github I am closing this issue for now. Please feel free to re-open if you have further comments

manodeep / Corrfunc