Closed Fonotec closed 4 years ago
@pelahi this looks similar to the issue you fixed last week.
@Fonotec is running the latest master or development branch and both show the issue.
I should have been fixed. @Fonotec did you update the submodules? I wasn't too careful with a version update but will soon finish updating a branch to better handle arbitrary input fields and arbitrary calculations on said input fields which will have a version update on the NBodylib and the minimum version of this library required by VR.
@Fonotec can you give the VR git version? and the git version of the NBodyLib submodule?
I am running with git version: d825c94931af93131734fd53a70a5d6969c8b350
of VELOCIraptor and the submodule versions are: 9d8619dbe88153f6af3820644379c600b2f2ea66 NBodylib (9d8619d)
and 655b3082c64d3fd9ada6c34097ef0a479299a40c tools (remotes/origin/include-snapshotoffset-25-g655b308)
I think these are also the most recent submodules.
It does appear to be the case. Odd. Can you provide me with the compilers and compilations options you used. Can you also just rm -rf * your build directory, rm -rf NBodylib, and the git submodule init; git submodule update?
the compiler is gcc/8.1 and the compilation options are cmake -DVR_USE_GAS=ON -DVR_USE_STAR=ON -DVR_USE_BH=ON ..
If I remove the build directory, remove Nbodylib and reinitialise I still get the same bug.
Hi @Fonotec , are you running on cosma? in swift? or separately? what mpi are you running with? That is how exactly are you running vr? I need that information to understand why my fix does not work in all instances.
This is running not on cosma but on a machine in Leiden, this is when I am running VR stand alone using /directiontostf/stf -C vrconfig_3dfof_subhalos_SO_hydro.cfg -i eagle_0036_exp -o halos_0036_exp -I 2. I am running with mpich 3.0.4.
Are you running this over MPI? Thought it was just a local run. Also, did you compile VR with or without MPI?
I am not running over MPI, it was compiled with MPI because that was on. But the run itself was run locally with 1 mpi thread and 80 openmp threads.
Hi @Fonotec , can you try running VR with nested parallelism explicitly turned off? export OMP_NESTED=FALSE My guess is the nested parallelism is on by default and is causing an issue. In which case, a simple fix of starting VR with this turn off in the code should work
Hi, I tried running it with nested parallelism explicitly turned off, but than I still get the same error.
HI @Fonotec, can you try something else for me as I am unable to reproduce your error. You can alter the NBodylib such that it does not try building and sort particles using nested parallelism. In Nbodylib/src/KDTree/KDTree.h lines 185, you'll constructors for the tree class. You can change one of the default values of the variables as indicated by the comment below.
KDTree(Particle *p, Int_t numparts,
Int_t bucket_size = 16, int TreeType=TPHYS, int KernType=KEPAN, int KernRes=1000,
int SplittingCriterion=0, int Aniso=0, int ScaleSpace=0,
Double_t *Period=NULL, Double_t **metric=NULL,
bool iBuildInParallel = false, //CHANGED, default is true
bool iKeepInputOrder = false
);
///Creates tree from NBody::System
KDTree(System &s,
Int_t bucket_size = 16, int TreeType=TPHYS, int KernType=KEPAN, int KernRes=1000,
int SplittingCriterion=0, int Aniso=0, int ScaleSpace=0, Double_t **metric=NULL,
bool iBuildInParallel = true,
bool iKeepInputOrder = false
);
This should turn off the nested parallelism by default. If this works, then I know the check I have implemented to turn off nested parallelism isn't working. Still need to figure out why.
@Fonotec besides Pascal's suggestion above, could you run on cosma as well? That should not suffer from the threading problem. If you see your other issue (the original one) there as well, please start a separate bug report with the details of that problem.
Hi @pelahi, let me try what you suggested. On Cosma there is no problem, only the original one I will make an issue for that.
@pelahi I tried your suggestions, it doesn't seem to work at the system here. I will ask the local IT department if it might be because of the computer itself here.
I believe there was an error in the nested thread creation which has now been fixed. Can you please confirm? @Fonotec @MatthieuSchaller?
Closing as assumed fixed.
Hi, I run VELOCIraptor using
/directionofvelocirpator/stf -C vrconfig_3dfof_subhalos_SO_hydro.cfg -i eagle_0036_exp -o halos_0036_exp -I 2
The code runs and finds ~43000 halos after this it starts finding properties of the halos and crashes after this:
0 Sort particles and compute properties of 43640 objects libgomp: libgomp: libgomp: Thread creation failed: Resource temporarily unavailableThread creation failed: Resource temporarily unavailable Thread creation failed: Resource temporarily unavailable [1] 169409 segmentation fault
I didn't expect it would crash like this, is this a problem on my side or is this a problem in VELOCIraptor.The complete output is given here: VR_output.txt
The parameter file I used is: vrconfig_3dfof_subhalos_SO_hydro.txt
This was run on a Red Hat Enterprise Linux Server (7.6) using gcc/8.1, hdf5/1.10.3 and mpi/mpich-x86_64.
Let me know if you need any extra information.