Idea for NEURON's Internal Data Structures

ctrl-z-9000-times commented 3 years ago

Hello NEURON Developers,

I am responding to your discussion about NEURON's internal data structures [1]. I have some ideas for how they could be improved.

My name is David McDougall. By training and profession I am a computer programmer [2]. However for the past several years my passion has been neuroscience. I have been observing the development of the NEURON simulator with interest, and at the same time developing my own neural simulator [3]. I believe that I get better at writing neural simulators with every try, and so after many attempts I think that I have found some good designs for implementing neural simulators.

NEURON's internal data structures are complicated and contain a large quantity data. Currently they are implemented as C structures which are connected by pointers. It is a working stable legacy code base. The improvements that I'm suggesting would require significant modifications.

EDIT: I wrote a more detailed and organized description: https://ctrl-z-9000-times.github.io/homepage/database_design.html

What NEURON needs is a piece of software to manage its data. A database. To make a database first you make a list of all of your data and a list of all of the ways you will interact with the data. Then you design the internal data structures and algorithms. Finally you design an API for using the database. I've done this design work. The remainder of this message is a description of my proposed database design.

Terminology To begin with, the database must be tailored for running physics simulations. To that end it has special words for dealing with physical things: Archetypes, Entities, and Components. An Entity is a thing which exists in your simulation, such as a neuron, a segment, or the 3D volume inside of a segment. Each Entity has a type which is known as an Archetype, and the Archetype defines what data Components are attached to each Entity. The data Components are where all of the data actually lives. The database will need multiple types of Archetypes, Entities, and Components for representing different and specialized things.

Types of Components and Archetypes

Attributes: Entities can have simple data attributes attached to them. For example segments have a voltage attribute (a floating point type) and a parent segment attribute (a pointer type). In the terminology of C/C++ programming: attributes correspond to instance variables. However unlike in C&C++, attributes are stored as structures of arrays (as opposed to arrays of structures).
Global constants: these values are like attributes except that all Entities see the same constant value.
Sparse Matrixes: Compressed Sparse Row (CSR) matrixes are a way for Entities to have lists of pointers to other Entities. For example you could have a sparse matrix of "synapse_weights" which gives each presynaptic neuron a list of postsynaptic dendrites and associated weights. Another example to store the electrical resistivity between adjacent segments using a sparse matrix.
Grids of Entities: an Archetype could define a regular grid to place Entities on and provide tools for efficiently working with grids of Entities. For example this could be used to implement extracellular diffusion at a course granularity.
And more. There are many data structures and algorithms which are common enough that it makes sense to put them in a shared library.

Pointers The database will need to contain pointers. However the database will never give the user a raw C pointer! Instead it present an Entity "Handle" which the database manages and retains access to. Because the database knows the location of every pointer, the database can move Entities & their Components to new memory locations and update all of the pointers. This is critically important for destroying entities and not leaving behind holes in your otherwise contiguous data arrays.

Graphics Cards The database could be made to understand GPU memory systems. It can integrate with an existing GPU programming environment (such as CUDA) and facilitate writing GPU kernels for it. My prototype database use the "cupy" and "numba" libraries to transfer data to/from GPUs and to write GPU kernels, respectively.

Error Checking You can add error checking to your components and have the database deal with running the checks. For example you can enforce a valid range of values, or to raise an error when it sees NaN or NULL pointers. These checks can be configured or disabled for each component. And of course you can have a global DEBUG flag to enable or disable error checking.

Documentation As the central repository for the data, it makes sense to also store any descriptions of the data in the database too. The database can render nice looking documentation of itself, showing how it is organized. You can write extra documentation for any Archetype or Component and have your explanations show up in the rendered document.

You can specify Units for your data components. The database could understand how units work, and provide tools for converting between units. Another use for units in the database is labeling the axes of data plots with the correct units.

Conclusion The primary benefit of using a database is that it gives you a way to think about data in the abstract. By putting all of the data into a single place you can see all of the commonalities. You can write software tools which do more things with fewer lines of code and with almost no redundant code duplication. New features can be enabled and applied to every datum from a central location.

All of these things can be implemented to run very efficiently. I've implemented a prototype which does most of these things with reasonable efficiency. My simulator is not yet complete but the database portion of it is mostly done. I'm planning on continuing my work on my project, but I could be persuaded to help to apply these ideas to the neural simulator of your choice.

If you are interested in discussing these ideas further then maybe we could talk at one of the NEURON developer meetings?

Sincerely, David McDougall

[1] https://drive.google.com/file/d/1nZxxTcLrNxIBj2z6zPGufxlCHaBqndzl/view?usp=sharing Overview of NEURON data structures.

[2] https://ctrl-z-9000-times.github.io/homepage/resume_2021.pdf My resume. I am currently available for employment.

[3] https://github.com/ctrl-z-9000-times/NEUWON Disclaimer, this is a work in progress and it does not have any documentation. I've provided this link as proof of its existence, nothing more.

pramodk commented 3 years ago

@ctrl-z-9000-times: first of all, thank you very much for writing a detailed proposal. Really appreciated!

I only spent ~10 mins to go through the above and will need more time to think through various aspects. But, just want to say that we completely agree about data structures redesign is quite important to improve the NEURON (performance & maintainability). So we are definitely interested to have a conversation in this regard. Will get in touch with you!

cc: @nrnhines @ramcdougal @alexsavulescu @ohm314 @olupton @iomaganaris

ctrl-z-9000-times commented 3 years ago

I've put together a more detailed and organized design document: https://ctrl-z-9000-times.github.io/homepage/database_design.html

nrnhines commented 3 years ago

This is a very timely and interesting proposal given that it incorporates in some ways a main aim I'm toying with with respect to planning the NEURON grant renewal. I.e. https://docs.google.com/document/d/1Ht5zR3XGHUeqj3URDDzGj7Hd_YIMF7DItnTCQMGx5Sg/edit?usp=sharing. Although I haven't had time to digest all the implications or magnitude of the change with respect to "DataBase" I'm keen on entering into discussion.

ctrl-z-9000-times commented 3 years ago

I read through your idea for "TrackingPointers" and the database should be able to do that. In fact, trying to do what TrackingPointers do was part of what made me realize that I need a database in the first place.

alexsavulescu commented 3 years ago

Hi @ctrl-z-9000-times , thank you for taking the time to come up with this proposal. We would definitely need more dedicated time than what we could fill into NEURON Dev Meeting. How about a first discussion this Friday, where maybe you could add a few slides with a high level overview/ architecture, together with some PROs and CONs of your solution? We can then see about a proper follow-up.

ctrl-z-9000-times commented 3 years ago

That sounds great. I will put together a presentation of the proposal.

I'm available all day on Friday, so I will let you all pick the meeting time.

alexsavulescu commented 3 years ago

Just a few slides to be presented @ NEURON Dev Meeting to begin with. Could you give me your email address to that I can invite you ?

ctrl-z-9000-times commented 3 years ago

My email address is: dam1784@rit.edu

alexsavulescu commented 3 years ago

You can find all details here: https://github.com/neuronsimulator/nrn/wiki/July-2021. Please let me know if the slides are unaccessible.

ctrl-z-9000-times commented 3 years ago

I got the zoom meeting link, thanks. I put together a slide show which I will present tomorrow: https://ctrl-z-9000-times.github.io/homepage/db_for_neural_sim.pdf

pramodk commented 3 years ago

Thanks @ctrl-z-9000-times !

Just to mention about SoA, you can find our approach via coreneuron library in https://www.frontiersin.org/articles/10.3389/fninf.2019.00063/full.

ctrl-z-9000-times commented 3 years ago

During the developers meeting I mentioned an alternative method for solving linear differential equations. Here is my primary reference on the topic:

Rotter, S., Diesmann, M. Exact digital simulation of time-invariant linear systems with applications to neuronal modeling. Biol Cybern 81, 381–402 (1999). https://doi.org/10.1007/s004220050570

Scipy implements the method to compute the matrix exponential: scipy.sparse.linalg.expm

Edit: The method is called "exact" so why did I say it is less accurate than what NEURON currently does? The problem is that the system of diff-eq for a neuron's voltage has both linear and non-linear terms. If you split that system of diff-eq into two systems, one for linear equations and another for non-linear, then you can solve each part in isolation and then combine their results. This is less accurate because the equations are no longer solved simultaneously, instead each sees the other as constant for the duration of the time step "dt". The advantage of this method is that I think it will run a lot faster.

neuronsimulator / nrn

Idea for NEURON's Internal Data Structures #1368