pelahi / VELOCIraptor-STF

Galaxy/(sub)Halo finder for N-body simulations
MIT License
19 stars 26 forks source link

VELOCIraptor crashes due to thread creation failed error #64

Closed Fonotec closed 4 years ago

Fonotec commented 4 years ago

Hi, I run VELOCIraptor using /directionofvelocirpator/stf -C vrconfig_3dfof_subhalos_SO_hydro.cfg -i eagle_0036_exp -o halos_0036_exp -I 2

The code runs and finds ~43000 halos after this it starts finding properties of the halos and crashes after this: 0 Sort particles and compute properties of 43640 objects libgomp: libgomp: libgomp: Thread creation failed: Resource temporarily unavailableThread creation failed: Resource temporarily unavailable Thread creation failed: Resource temporarily unavailable [1] 169409 segmentation fault I didn't expect it would crash like this, is this a problem on my side or is this a problem in VELOCIraptor.

The complete output is given here: VR_output.txt

The parameter file I used is: vrconfig_3dfof_subhalos_SO_hydro.txt

This was run on a Red Hat Enterprise Linux Server (7.6) using gcc/8.1, hdf5/1.10.3 and mpi/mpich-x86_64.

Let me know if you need any extra information.

MatthieuSchaller commented 4 years ago

@pelahi this looks similar to the issue you fixed last week.

@Fonotec is running the latest master or development branch and both show the issue.

pelahi commented 4 years ago

I should have been fixed. @Fonotec did you update the submodules? I wasn't too careful with a version update but will soon finish updating a branch to better handle arbitrary input fields and arbitrary calculations on said input fields which will have a version update on the NBodylib and the minimum version of this library required by VR.

MatthieuSchaller commented 4 years ago

@Fonotec can you give the VR git version? and the git version of the NBodyLib submodule?

Fonotec commented 4 years ago

I am running with git version: d825c94931af93131734fd53a70a5d6969c8b350 of VELOCIraptor and the submodule versions are: 9d8619dbe88153f6af3820644379c600b2f2ea66 NBodylib (9d8619d) and 655b3082c64d3fd9ada6c34097ef0a479299a40c tools (remotes/origin/include-snapshotoffset-25-g655b308)

Fonotec commented 4 years ago

I think these are also the most recent submodules.

pelahi commented 4 years ago

It does appear to be the case. Odd. Can you provide me with the compilers and compilations options you used. Can you also just rm -rf * your build directory, rm -rf NBodylib, and the git submodule init; git submodule update?

Fonotec commented 4 years ago

the compiler is gcc/8.1 and the compilation options are cmake -DVR_USE_GAS=ON -DVR_USE_STAR=ON -DVR_USE_BH=ON .. If I remove the build directory, remove Nbodylib and reinitialise I still get the same bug.

pelahi commented 4 years ago

Hi @Fonotec , are you running on cosma? in swift? or separately? what mpi are you running with? That is how exactly are you running vr? I need that information to understand why my fix does not work in all instances.

Fonotec commented 4 years ago

This is running not on cosma but on a machine in Leiden, this is when I am running VR stand alone using /directiontostf/stf -C vrconfig_3dfof_subhalos_SO_hydro.cfg -i eagle_0036_exp -o halos_0036_exp -I 2. I am running with mpich 3.0.4.

MatthieuSchaller commented 4 years ago

Are you running this over MPI? Thought it was just a local run. Also, did you compile VR with or without MPI?

Fonotec commented 4 years ago

I am not running over MPI, it was compiled with MPI because that was on. But the run itself was run locally with 1 mpi thread and 80 openmp threads.

pelahi commented 4 years ago

Hi @Fonotec , can you try running VR with nested parallelism explicitly turned off? export OMP_NESTED=FALSE My guess is the nested parallelism is on by default and is causing an issue. In which case, a simple fix of starting VR with this turn off in the code should work

Fonotec commented 4 years ago

Hi, I tried running it with nested parallelism explicitly turned off, but than I still get the same error.

pelahi commented 4 years ago

HI @Fonotec, can you try something else for me as I am unable to reproduce your error. You can alter the NBodylib such that it does not try building and sort particles using nested parallelism. In Nbodylib/src/KDTree/KDTree.h lines 185, you'll constructors for the tree class. You can change one of the default values of the variables as indicated by the comment below.

        KDTree(Particle *p, Int_t numparts,
            Int_t bucket_size = 16, int TreeType=TPHYS, int KernType=KEPAN, int KernRes=1000,
            int SplittingCriterion=0, int Aniso=0, int ScaleSpace=0,
            Double_t *Period=NULL, Double_t **metric=NULL,
            bool iBuildInParallel = false, //CHANGED, default is true
            bool iKeepInputOrder = false
        );
        ///Creates tree from NBody::System
        KDTree(System &s,
            Int_t bucket_size = 16, int TreeType=TPHYS, int KernType=KEPAN, int KernRes=1000,
            int SplittingCriterion=0, int Aniso=0, int ScaleSpace=0, Double_t **metric=NULL,
            bool iBuildInParallel = true,
            bool iKeepInputOrder = false
        );

This should turn off the nested parallelism by default. If this works, then I know the check I have implemented to turn off nested parallelism isn't working. Still need to figure out why.

MatthieuSchaller commented 4 years ago

@Fonotec besides Pascal's suggestion above, could you run on cosma as well? That should not suffer from the threading problem. If you see your other issue (the original one) there as well, please start a separate bug report with the details of that problem.

Fonotec commented 4 years ago

Hi @pelahi, let me try what you suggested. On Cosma there is no problem, only the original one I will make an issue for that.

Fonotec commented 4 years ago

@pelahi I tried your suggestions, it doesn't seem to work at the system here. I will ask the local IT department if it might be because of the computer itself here.

pelahi commented 4 years ago

I believe there was an error in the nested thread creation which has now been fixed. Can you please confirm? @Fonotec @MatthieuSchaller?

pelahi commented 4 years ago

Closing as assumed fixed.