retina_worker_only_demo running problem

iKaHibi commented 3 years ago

Sorry for keep adding new issue, maybe this time I will leave this issue open until solving all the problems during running retina_worker_only_demo.py

After I turned to python3 environment to run the example code, the import part turns to cause errors, like

Traceback (most recent call last):
  File "/Aurel-retina-master/examples/retina_worker_only_demo/retina_worker_only_demo.py", line 28, in <module>
    import retina.retina as ret
  File "/Aurel-retina-master/retina/retina.py", line 16, in <module>
    from geometry.opticaxis import opticaxisFactory, RuleHexArrayMap, OpticAxisRule
ModuleNotFoundError: No module named 'geometry'

So I add the required path by adding code

sys.path.append('/Aurel-retina-master/retina')
sys.path.append('/Aurel-retina-master/retina/screen')
sys.path.append('/Aurel-retina-master/retina/screen/map')
sys.path.append('/Aurel-retina-master/retina/screen/transform')
sys.path.append('/Aurel-retina-master/retina/vrf')
sys.path.append('/Aurel-retina-master/retina/vrf/utils')

to the head of retina_worker_only_demo.py

This did not solve the problem as when the code start to use neurokernel, import path still cause errors.

Starting retina simulation
Manager spawned
Traceback (most recent call last):
  File "/anaconda3/envs/nk/lib/python3.7/site-packages/neurokernel/mpi_backend.py", line 92, in <module>
    target, target_globals, kwargs = parent.recv()
  File "mpi4py/MPI/Comm.pyx", line 1438, in mpi4py.MPI.Comm.recv
  File "mpi4py/MPI/msgpickle.pxi", line 341, in mpi4py.MPI.PyMPI_recv
  File "mpi4py/MPI/msgpickle.pxi", line 306, in mpi4py.MPI.PyMPI_recv_match
  File "mpi4py/MPI/msgpickle.pxi", line 152, in mpi4py.MPI.pickle_load
  File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
  File "/anaconda3/envs/nk/lib/python3.7/site-packages/dill/_dill.py", line 327, in loads
    return load(file, ignore, **kwds)
  File "/anaconda3/envs/nk/lib/python3.7/site-packages/dill/_dill.py", line 313, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/anaconda3/envs/nk/lib/python3.7/site-packages/dill/_dill.py", line 525, in load
    obj = StockUnpickler.load(self)
  File "/anaconda3/envs/nk/lib/python3.7/site-packages/dill/_dill.py", line 515, in find_class
    return StockUnpickler.find_class(self, module, name)
  File "/Aurel-retina-master/retina/InputProcessors/RetinaInputProcessor.py", line 7, in <module>
    import retina.classmapper as cls_map
  File "/Aurel-retina-master/retina/classmapper.py", line 1, in <module>
    import screen.screen as scr
ModuleNotFoundError: No module named 'screen'
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

How can I solve the problem related to import path?

By the way, to overcome the problem TypeError: add_node() takes 2 positional arguments but 3 were given I changed the code in retina.py from

G_workers_nomaster.add_node(
                        neuron.id+'_photon',
                        {'class': 'BufferPhoton',
                        'name': '{}_buf'.format(name)
                    })

to

G_workers_nomaster.add_node(
                        neuron.id+'_photon',
                        **{'class': 'BufferPhoton',
                        'name': '{}_buf'.format(name)
                    })

I hope this will not cause bug in the future.

yiyin commented 3 years ago

I believe that you installed the release that is from 4 years ago. Please install from master.

iKaHibi commented 3 years ago

Thank you for your quick reply! I do used an old version, and after I installed from master, I got the problem like this:

Elapsed time for retina simulation: 5.39s
retina0: 100%|██████████| 1/1 [00:00<00:00, 1693.30it/s]
An error occured during execution of LPU retina0 at step 0:
Traceback (most recent call last):
  File "/anaconda3/envs/nk/lib/python3.7/site-packages/pycuda/tools.py", line 470, in wrapper
    return ctx_dict[cur_ctx][cache_key]
KeyError: <pycuda._driver.Context object at 0x7fe35f234c30>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda3/envs/nk/lib/python3.7/site-packages/neurokernel/tools/misc.py", line 144, in catch_exception
    func(*args, **kwargs)
  File "/neurodriver-master/neurokernel/LPU/LPU.py", line 942, in pre_run
    self.init_variable_memory()
  File "/neurodriver-master/neurokernel/LPU/LPU.py", line 1171, in init_variable_memory
    info=d)
  File "/neurodriver-master/neurokernel/LPU/MemoryManager.py", line 93, in memory_alloc
    CircularArray(size, buffer_length, dtype, init)}
  File "/neurodriver-master/neurokernel/LPU/MemoryManager.py", line 244, in __init__
    (buffer_length, size), dtype)
  File "/neurodriver-master/neurokernel/LPU/utils/parray.py", line 2016, in zeros
    result.fill(0)
  File "/neurodriver-master/neurokernel/LPU/utils/parray.py", line 1528, in fill
    self.dtype, pitch = True)
  File "/anaconda3/envs/nk/lib/python3.7/site-packages/pycuda/tools.py", line 474, in wrapper
    result = func(*args, **kwargs)
  File "/neurodriver-master/neurokernel/LPU/utils/parray_utils.py", line 29, in get_fill_function
    }, options=["--ptxas-options=-v"]).get_function(name)
  File "/anaconda3/envs/nk/lib/python3.7/site-packages/pycuda/compiler.py", line 358, in __init__
    include_dirs,
  File "/anaconda3/envs/nk/lib/python3.7/site-packages/pycuda/compiler.py", line 298, in compile
    return compile_plain(source, options, keep, nvcc, cache_dir, target)
  File "/anaconda3/envs/nk/lib/python3.7/site-packages/pycuda/compiler.py", line 87, in compile_plain
    checksum.update(preprocess_source(source, options, nvcc).encode("utf-8"))
  File "/anaconda3/envs/nk/lib/python3.7/site-packages/pycuda/compiler.py", line 59, in preprocess_source
    "nvcc preprocessing of %s failed" % source_path, cmdline, stderr=stderr
pycuda.driver.CompileError: nvcc preprocessing of /tmp/tmpfzca4emr.cu failed
[command: nvcc --preprocess --ptxas-options=-v -arch sm_75 -I/anaconda3/envs/nk/lib/python3.7/site-packages/pycuda/cuda /tmp/tmpfzca4emr.cu --compiler-options -P]
[stderr:
b"nvcc fatal   : Value 'sm_75' is not defined for option 'gpu-architecture'\n"]

Do this mean that I made some mistake when installing cuda or neurodriver?

yiyin commented 3 years ago

What's your CUDA version? Try

nvcc --version

iKaHibi commented 3 years ago

My CUDA version is V9.0.176 and I am using OpenMPI-4.1.0.

yiyin commented 3 years ago

That version of CUDA does not support sm_75. I wonder where that came from. What GPU card do you have?

iKaHibi commented 3 years ago

I got a nvidia rtx 2060.

yiyin commented 3 years ago

Yeah, your CUDA version is too low for this card. Try update to the latest NVIDIA driver and CUDA.

iKaHibi commented 3 years ago

Thank you for your response! After updating my CUDA to 11.3 and reinstalled pyCUDA and scikit-CUDA, I met new error:

Manager spawned
/home/anaconda3/envs/nk/lib/python3.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
retina0: Number of PhotoreceptorModel: 4326
retina0: Number of BufferPhoton: 4326
retina0: Number of Port: 8652
retina0: Number of Input: {'photon': 4326}
retina0: 100%|██████████| 1/1 [00:00<00:00, 15.17it/s]
Elapsed time for retina simulation: 7.22s
An error occured during execution of LPU retina0 at step 0:
Traceback (most recent call last):
  File "/home/anaconda3/envs/nk/lib/python3.7/site-packages/neurokernel/tools/misc.py", line 144, in catch_exception
    func(*args, **kwargs)
  File "/home/Code/flybrain/new_src/neurodriver-master/neurokernel/LPU/LPU.py", line 1299, in pre_run
    p._pre_run()
  File "/home/Code/flybrain/new_src/neurodriver-master/neurokernel/LPU/InputProcessors/BaseInputProcessor.py", line 247, in _pre_run
    self.pre_run()
  File "/home/Code/flybrain/src_code/retina-master/retina/InputProcessors/RetinaInputProcessor.py", line 29, in pre_run
    self.generate_receptive_fields()
  File "/home/Code/flybrain/src_code/retina-master/retina/InputProcessors/RetinaInputProcessor.py", line 101, in generate_receptive_fields
    rfs.generate_filters()
  File "/home/Code/flybrain/src_code/retina-master/retina/vrf/vrf.py", line 90, in generate_filters
    (N_filters, self.size), self.dtype)
  File "/home/Code/flybrain/src_code/retina-master/retina/vrf/utils/parray.py", line 270, in __init__
    self.shape[0], np.dtype(dtype).itemsize)
pycuda._driver.MemoryError: cuMemAllocPitch failed: out of memory

Does this mean the video memory of my GPU is not enough to run the code?

yiyin commented 3 years ago

The full model requires about 4GB of GPU memory. The standard memory configuration for 2060 should have 6GB. But if you really have less memory, try to run a smaller model, to start with, reduce the number of rings to 0 (see also Figure 4a of Neurokernel RFC #3) by setting

[Retina]
  rings = 0

in your config file. That will run only 1 ommatidium in the center with 6 photoreceptors.

iKaHibi commented 3 years ago

I changed the config to

[Retina]
    rings = 0

and got the following feedback:

/home/anaconda3/envs/nk/lib/python3.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
/home/anaconda3/envs/nk/lib/python3.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
Starting getting configuration
Elapsed time for getting configuration: 0.00s
Starting instantiation of retina
Using input generating function
Elapsed time for instantiation of retina: 0.07s
Starting retina simulation
Manager spawned
/home/anaconda3/envs/nk/lib/python3.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
retina0: Number of PhotoreceptorModel: 6
retina0: Number of BufferPhoton: 6
retina0: Number of Port: 12
retina0: Number of Input: {'photon': 6}
Compilation of executable circuit completed in 0.4685242176055908 seconds
retina0: 100%|██████████| 1/1 [00:00<00:00,  2.02it/s]
Elapsed time for retina simulation: 2.09s
closing natural_xy file
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 0 on
node akiohibi-Sys exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).

You can avoid this message by specifying -quiet on the mpiexec command line.
--------------------------------------------------------------------------

And, surprisingly, as I reboot my computer and change config back to

[Retina]
    rings = 14

I got a different error saying:

/home/anaconda3/envs/nk/lib/python3.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
/home/anaconda3/envs/nk/lib/python3.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
Starting getting configuration
Elapsed time for getting configuration: 0.00s
Starting instantiation of retina
Using input generating function
Elapsed time for instantiation of retina: 1.65s
Starting retina simulation
Manager spawned
/home/anaconda3/envs/nk/lib/python3.7/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
  warnings.warn('creating CUBLAS context to get version number')
retina0: Number of BufferPhoton: 4326
retina0: Number of Port: 8652
retina0: Number of PhotoreceptorModel: 4326
retina0: Number of Input: {'photon': 4326}
Compilation of executable circuit completed in 2.6839797496795654 seconds
retina0: 100%|██████████| 1/1 [00:00<00:00,  5.84it/s]
An error occured during execution of LPU retina0 at step 0:
Traceback (most recent call last):
  File "/home/anaconda3/envs/nk/lib/python3.7/site-packages/neurokernel/tools/misc.py", line 144, in catch_exception
    func(*args, **kwargs)
  File "/home/Code/flybrain/new_src/neurodriver-master/neurokernel/LPU/LPU.py", line 1550, in run_step
    for p in self.input_processors: p.run_step()
  File "/home/Code/flybrain/new_src/neurodriver-master/neurokernel/LPU/InputProcessors/BaseInputProcessor.py", line 91, in run_step
    self.update_input()
  File "/home/Code/flybrain/src_code/retina-master/retina/InputProcessors/RetinaInputProcessor.py", line 107, in update_input
    inputs = self.rfs.filter_image_use(im).get().reshape((-1))
  File "/home/Code/flybrain/src_code/retina-master/retina/vrf/vrf.py", line 184, in filter_image_use
    handle = la.cublashandle()
  File "/home/Code/flybrain/src_code/retina-master/retina/vrf/utils/linalg.py", line 18, in __init__
    self.create()
  File "/home/Code/flybrain/src_code/retina-master/retina/vrf/utils/linalg.py", line 22, in create
    self.handle = cublas.cublasCreate()
  File "/home/anaconda3/envs/nk/lib/python3.7/site-packages/skcuda/cublas.py", line 203, in cublasCreate
    cublasCheckStatus(status)
  File "/home/anaconda3/envs/nk/lib/python3.7/site-packages/skcuda/cublas.py", line 179, in cublasCheckStatus
    raise e
skcuda.cublas.cublasNotInitialized

Elapsed time for retina simulation: 7.92s
closing natural_xy file
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

Is this caused by mistake in scikit-cuda installing?

yiyin commented 3 years ago

I am not clear why cublasCreate failed to initialize, as a handle has been created when getting the version number (that's what these warnings are about) and no error was report then.

You might want to check your LD_LIBRARY_PATH to make sure that it's getting the right path to the libcublas.so from the current version of CUDA. You can check the version of cublas by

import skcuda.cublas as cublas
print(cublas._cublas_version)

make sure that this version is consistent with the libcublas.so.x.x.x in your CUDA path.

neurokernel / retina

retina_worker_only_demo running problem #3