neuronsimulator / nrn

NEURON Simulator
http://nrn.readthedocs.io
Other
401 stars 117 forks source link

CoreNEURON data generation (bbcore_write) produces " A thread has no real cells or the first cell has no gid" error #475

Closed pramodk closed 4 years ago

pramodk commented 4 years ago

I think we fixed this issue in the past but I see this again. When we use more ranks than cells, currently I see :

        136 ParallelContext[0].nrnbbcore_write("coredat")
      136 prun1()
    136 prun()
  136 methodrun(0)
and others
58 /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/libraries/2020-02-01/linux-rhel7-x86_64/intel-19.0.4/neuron-7.8.0b-3lvust5k7q/
x86_64/bin/nrniv: A thread has no real cells or the first cell has no gid
58  in run.hoc near line 20
pramodk commented 4 years ago

I just realised that this is not for empty rank but if we have model with artificial cells only then we can't run. I am trying to run https://github.com/nrnhines/tqperf model and even with single rank I get:

$ srun -n 1 nrniv -c tstop=5 run.hoc -mpi
srun: Warning: can't run 1 processes on 6 nodes, setting nnodes to 1
numprocs=1
NEURON -- VERSION 7.8.0-2-g92a208b+ HEAD (92a208b+) 2019-10-29
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2018
See http://neuron.yale.edu/neuron/credits

Additional mechanisms from files
 ./invlfire.mod
SetupTime: 0.079999924
enter model_destroy: UsedMem 0
leave model_destroy: UsedMem 0
mkmodel_time 4.71
seq = 0
ncell = 4096 ncon = 1000 tstop = 5
compress_bufsize=10 binqueue=0 selfqueue=0 bgpdma=0
Before stdinit FreeMem 0
write coredat files
/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/libraries/2020-02-01/linux-rhel7-x86_64/intel-19.0.4/neuron-7.8.0b-3lvust5k7q/
x86_64/bin/nrniv: A thread has no real cells or the first cell has no gid
 in run.hoc near line 20
 }
  ^
        ParallelContext[0].nrnbbcore_write("coredat")
      prun1()
    prun()
  methodrun(0)
and others
nrnhines commented 4 years ago

The problem is that the system has no way of naming the file (since there is no gid). There is no fundamental issue with simulating cells (artificial or real) without a gid. We need a naming convention that a allows this. The nice thing about gid is that they are globally unambiguous. A name that uses the rank and thread, would have to make sure it did not conflict with a gid. Perhaps by using a char that indicated the name was not derived from the gid of the first cell.

pramodk commented 4 years ago

The problem is that the system has no way of naming the file (since there is no gid). There is no fundamental issue with simulating cells (artificial or real) without a gid.

@nrnhines : Is that also true with tqperf model? Because we have been using the same model for benchmarking for quite some time without any issue.

nrnhines commented 4 years ago

The fragment in nrnbbcore_write.cpp is:

  // use first real cell gid, if it exists, as the group_id   
  if (corenrn_direct == false) for (int i=0; i < nrn_nthread; ++i) {   
    if (cgs[i].n_real_output && cgs[i].output_gid[0] >= 0) {
      cgs[i].group_id = cgs[i].output_gid[0];
    }else{
      hoc_execerror("A thread has no real cells or the first cell has no gid", NULL);
    }
  }

So this is only an issue with the file based transfer. Where is the tqperf you mentioned?

pramodk commented 4 years ago

Where is the tqperf you mentioned?

this one : https://github.com/nrnhines/tqperf

nrnhines commented 4 years ago

I am puzzled. tqperf/perfrun.hoc has artificial cells with gid. Somehow the change that allowed this has gotten lost. There is code earlier in the file that sets the cgs[i].group_id to the gid of an artificial cells and also if the cell (artificial or real) does not have a gid.

          if (ps) {
            if (ps->output_index_ >= 0) { // has gid
              cgs[i].output_gid[npre] = ps->output_index_;
              if (cgs[i].group_id < 0) {
                cgs[i].group_id = ps->output_index_;
              }
              ++cgs[i].n_output;
            }else{
              cgs[i].output_gid[npre] = agid;
            }

Do you have a record of the version number of NEURON which successfully ran perfrun.hoc?

pramodk commented 4 years ago

Do you have a record of the version number of NEURON which successfully ran perfrun.hoc?

On BB5 we had 7.6.8 which was working fine.

nrnhines commented 4 years ago

The offender that introduced the hoc_execerror is

commit be01511953bf9506526ac88a07897644df642fab
Author: Michael Hines <michael.hines@yale.edu>
Date:   Tue Mar 26 17:00:59 2019 -0400

    ParallelContext.nrnbbcore_write("dir") generates error messages if...
      The model has not been initialized
      A thread does not have a real cell with a gid.
      dir does not exist as a directory and mkdir(dir) fails

The allowance for artificial cells with a gid was introduced in

commit 49f832ff5dfaa163ae9d4cd1c2515fa75650b853
Author: Michael Hines <michael.hines@yale.edu>
Date:   Mon Dec 19 15:58:06 2016 -0500

    pc.nrnbbcore_write group id for generating files must be >= 0.
    If there are no real cells, the first output_gid >= 0 is used.
pramodk commented 4 years ago

The offender that introduced the hoc_execerror is

👍

Edit : @nrnhines : what is confusing me is that the file generated with the older version has non-negative gids:

$ ls -lrt coredat/
total 102722
-rw-rw----+ 1 kumbhar bbp 1568973 Apr 11 22:43 8_1.dat
-rw-rw----+ 1 kumbhar bbp 1568973 Apr 11 22:43 9_1.dat
-rw-rw----+ 1 kumbhar bbp 1572809 Apr 11 22:43 5_1.dat
-rw-rw----+ 1 kumbhar bbp 1572809 Apr 11 22:43 4_1.dat
-rw-rw----+ 1 kumbhar bbp       4 Apr 11 22:43 byteswap1.dat
-rw-rw----+ 1 kumbhar bbp 1568973 Apr 11 22:43 7_1.dat
-rw-rw----+ 1 kumbhar bbp 1572809 Apr 11 22:43 1_1.dat
-rw-rw----+ 1 kumbhar bbp 1572809 Apr 11 22:43 3_1.dat
-rw-rw----+ 1 kumbhar bbp 1572809 Apr 11 22:43 2_1.dat
-rw-rw----+ 1 kumbhar bbp     521 Apr 11 22:43 bbcore_mech.dat
-rw-rw----+ 1 kumbhar bbp 1568973 Apr 11 22:43 6_1.dat
-rw-rw----+ 1 kumbhar bbp     649 Apr 11 22:43 globals.dat
-rw-rw----+ 1 kumbhar bbp 1572809 Apr 11 22:43 0_1.dat
-rw-rw----+ 1 kumbhar bbp 9462895 Apr 11 22:43 8_2.dat
-rw-rw----+ 1 kumbhar bbp 9462895 Apr 11 22:43 9_2.dat
-rw-rw----+ 1 kumbhar bbp 9486031 Apr 11 22:43 5_2.dat
-rw-rw----+ 1 kumbhar bbp 9486031 Apr 11 22:43 4_2.dat
-rw-rw----+ 1 kumbhar bbp 9486031 Apr 11 22:43 1_2.dat
-rw-rw----+ 1 kumbhar bbp 9462895 Apr 11 22:43 7_2.dat
-rw-rw----+ 1 kumbhar bbp 9462895 Apr 11 22:43 6_2.dat
-rw-rw----+ 1 kumbhar bbp 9486031 Apr 11 22:43 3_2.dat
-rw-rw----+ 1 kumbhar bbp 9486031 Apr 11 22:43 2_2.dat
-rw-rw----+ 1 kumbhar bbp 9486031 Apr 11 22:43 0_2.dat
-rw-rw----+ 1 kumbhar bbp      27 Apr 11 22:43 files.dat

So why it causes an error?

nrnhines commented 4 years ago

The test was too strict as it did not take into account that cgs[i].group_id may have already been set to the first artificial cell that had a ps->output_index >=0. Please test the following change:

$ git diff
diff --git a/src/nrniv/nrnbbcore_write.cpp b/src/nrniv/nrnbbcore_write.cpp
index 14bfc10b..b0f20c6e 100644
--- a/src/nrniv/nrnbbcore_write.cpp
+++ b/src/nrniv/nrnbbcore_write.cpp
@@ -702,6 +702,8 @@ CellGroup* mk_cellgroups() {
   if (corenrn_direct == false) for (int i=0; i < nrn_nthread; ++i) {
     if (cgs[i].n_real_output && cgs[i].output_gid[0] >= 0) {
       cgs[i].group_id = cgs[i].output_gid[0];
+    }else if (cgs[i].group_id >= 0) {
+      // set above to first artificial cell with a ps->output_index >= 0
     }else{
       hoc_execerror("A thread has no real cells or the first cell has no gid", NULL);
     }

I believe that in principle we can relax this further even if there are no gid's but it may not be important in the context of file mode transfer to CoreNEURON. Probably the execerror should read:

A thread has no real or ARTIFICIAL_CELL with a gid
pramodk commented 4 years ago

Please test the following change:

With the above patch, tqperf works! good to push to the master.

nrnhines commented 4 years ago

b07f51e8 has the fix.