Closed pramodk closed 4 years ago
I just realised that this is not for empty rank but if we have model with artificial cells only then we can't run. I am trying to run https://github.com/nrnhines/tqperf model and even with single rank I get:
$ srun -n 1 nrniv -c tstop=5 run.hoc -mpi
srun: Warning: can't run 1 processes on 6 nodes, setting nnodes to 1
numprocs=1
NEURON -- VERSION 7.8.0-2-g92a208b+ HEAD (92a208b+) 2019-10-29
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2018
See http://neuron.yale.edu/neuron/credits
Additional mechanisms from files
./invlfire.mod
SetupTime: 0.079999924
enter model_destroy: UsedMem 0
leave model_destroy: UsedMem 0
mkmodel_time 4.71
seq = 0
ncell = 4096 ncon = 1000 tstop = 5
compress_bufsize=10 binqueue=0 selfqueue=0 bgpdma=0
Before stdinit FreeMem 0
write coredat files
/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/libraries/2020-02-01/linux-rhel7-x86_64/intel-19.0.4/neuron-7.8.0b-3lvust5k7q/
x86_64/bin/nrniv: A thread has no real cells or the first cell has no gid
in run.hoc near line 20
}
^
ParallelContext[0].nrnbbcore_write("coredat")
prun1()
prun()
methodrun(0)
and others
The problem is that the system has no way of naming the file (since there is no gid). There is no fundamental issue with simulating cells (artificial or real) without a gid. We need a naming convention that a allows this. The nice thing about gid is that they are globally unambiguous. A name that uses the rank and thread, would have to make sure it did not conflict with a gid. Perhaps by using a char that indicated the name was not derived from the gid of the first cell.
The problem is that the system has no way of naming the file (since there is no gid). There is no fundamental issue with simulating cells (artificial or real) without a gid.
@nrnhines : Is that also true with tqperf model? Because we have been using the same model for benchmarking for quite some time without any issue.
The fragment in nrnbbcore_write.cpp is:
// use first real cell gid, if it exists, as the group_id
if (corenrn_direct == false) for (int i=0; i < nrn_nthread; ++i) {
if (cgs[i].n_real_output && cgs[i].output_gid[0] >= 0) {
cgs[i].group_id = cgs[i].output_gid[0];
}else{
hoc_execerror("A thread has no real cells or the first cell has no gid", NULL);
}
}
So this is only an issue with the file based transfer. Where is the tqperf you mentioned?
Where is the tqperf you mentioned?
this one : https://github.com/nrnhines/tqperf
I am puzzled. tqperf/perfrun.hoc has artificial cells with gid. Somehow the change that allowed this has gotten lost. There is code earlier in the file that sets the cgs[i].group_id to the gid of an artificial cells and also if the cell (artificial or real) does not have a gid.
if (ps) {
if (ps->output_index_ >= 0) { // has gid
cgs[i].output_gid[npre] = ps->output_index_;
if (cgs[i].group_id < 0) {
cgs[i].group_id = ps->output_index_;
}
++cgs[i].n_output;
}else{
cgs[i].output_gid[npre] = agid;
}
Do you have a record of the version number of NEURON which successfully ran perfrun.hoc?
Do you have a record of the version number of NEURON which successfully ran perfrun.hoc?
On BB5 we had 7.6.8 which was working fine.
The offender that introduced the hoc_execerror is
commit be01511953bf9506526ac88a07897644df642fab
Author: Michael Hines <michael.hines@yale.edu>
Date: Tue Mar 26 17:00:59 2019 -0400
ParallelContext.nrnbbcore_write("dir") generates error messages if...
The model has not been initialized
A thread does not have a real cell with a gid.
dir does not exist as a directory and mkdir(dir) fails
The allowance for artificial cells with a gid was introduced in
commit 49f832ff5dfaa163ae9d4cd1c2515fa75650b853
Author: Michael Hines <michael.hines@yale.edu>
Date: Mon Dec 19 15:58:06 2016 -0500
pc.nrnbbcore_write group id for generating files must be >= 0.
If there are no real cells, the first output_gid >= 0 is used.
The offender that introduced the hoc_execerror is
👍
Edit : @nrnhines : what is confusing me is that the file generated with the older version has non-negative gids:
$ ls -lrt coredat/
total 102722
-rw-rw----+ 1 kumbhar bbp 1568973 Apr 11 22:43 8_1.dat
-rw-rw----+ 1 kumbhar bbp 1568973 Apr 11 22:43 9_1.dat
-rw-rw----+ 1 kumbhar bbp 1572809 Apr 11 22:43 5_1.dat
-rw-rw----+ 1 kumbhar bbp 1572809 Apr 11 22:43 4_1.dat
-rw-rw----+ 1 kumbhar bbp 4 Apr 11 22:43 byteswap1.dat
-rw-rw----+ 1 kumbhar bbp 1568973 Apr 11 22:43 7_1.dat
-rw-rw----+ 1 kumbhar bbp 1572809 Apr 11 22:43 1_1.dat
-rw-rw----+ 1 kumbhar bbp 1572809 Apr 11 22:43 3_1.dat
-rw-rw----+ 1 kumbhar bbp 1572809 Apr 11 22:43 2_1.dat
-rw-rw----+ 1 kumbhar bbp 521 Apr 11 22:43 bbcore_mech.dat
-rw-rw----+ 1 kumbhar bbp 1568973 Apr 11 22:43 6_1.dat
-rw-rw----+ 1 kumbhar bbp 649 Apr 11 22:43 globals.dat
-rw-rw----+ 1 kumbhar bbp 1572809 Apr 11 22:43 0_1.dat
-rw-rw----+ 1 kumbhar bbp 9462895 Apr 11 22:43 8_2.dat
-rw-rw----+ 1 kumbhar bbp 9462895 Apr 11 22:43 9_2.dat
-rw-rw----+ 1 kumbhar bbp 9486031 Apr 11 22:43 5_2.dat
-rw-rw----+ 1 kumbhar bbp 9486031 Apr 11 22:43 4_2.dat
-rw-rw----+ 1 kumbhar bbp 9486031 Apr 11 22:43 1_2.dat
-rw-rw----+ 1 kumbhar bbp 9462895 Apr 11 22:43 7_2.dat
-rw-rw----+ 1 kumbhar bbp 9462895 Apr 11 22:43 6_2.dat
-rw-rw----+ 1 kumbhar bbp 9486031 Apr 11 22:43 3_2.dat
-rw-rw----+ 1 kumbhar bbp 9486031 Apr 11 22:43 2_2.dat
-rw-rw----+ 1 kumbhar bbp 9486031 Apr 11 22:43 0_2.dat
-rw-rw----+ 1 kumbhar bbp 27 Apr 11 22:43 files.dat
So why it causes an error?
The test was too strict as it did not take into account that cgs[i].group_id may have already been set to the first artificial cell that had a ps->output_index >=0. Please test the following change:
$ git diff
diff --git a/src/nrniv/nrnbbcore_write.cpp b/src/nrniv/nrnbbcore_write.cpp
index 14bfc10b..b0f20c6e 100644
--- a/src/nrniv/nrnbbcore_write.cpp
+++ b/src/nrniv/nrnbbcore_write.cpp
@@ -702,6 +702,8 @@ CellGroup* mk_cellgroups() {
if (corenrn_direct == false) for (int i=0; i < nrn_nthread; ++i) {
if (cgs[i].n_real_output && cgs[i].output_gid[0] >= 0) {
cgs[i].group_id = cgs[i].output_gid[0];
+ }else if (cgs[i].group_id >= 0) {
+ // set above to first artificial cell with a ps->output_index >= 0
}else{
hoc_execerror("A thread has no real cells or the first cell has no gid", NULL);
}
I believe that in principle we can relax this further even if there are no gid's but it may not be important in the context of file mode transfer to CoreNEURON. Probably the execerror should read:
A thread has no real or ARTIFICIAL_CELL with a gid
Please test the following change:
With the above patch, tqperf works! good to push to the master.
b07f51e8 has the fix.
I think we fixed this issue in the past but I see this again. When we use more ranks than cells, currently I see :