open-ephys / next-gen-system

work in progress repository for next generation acquisition and closed-loop feedback system
https://open-ephys.atlassian.net/wiki/display/OEW/PCIe+acquisition+board
4 stars 4 forks source link

"Striding" channels during DMA for better memory locality and easier coding #4

Closed shlevy closed 6 years ago

shlevy commented 8 years ago

Some applications may want to work with blocks of several samples as a unit of input, not just a single sample. It would be nice if the driver could be configured so that, instead of putting the values for all channels for a single sample into a single contiguous region, the values were written with gaps the size of the sample block and the initial offset increased until the full block is written, so that the application can then just read a contiguous region of memory as a single channel, rather than having to jump from one sample to the next.

More concretely, call the value of the nth channel at the mth time offset Vn,m. Then a naive implementation of DMAing a 4-channel input to a circular buffer 3 samples wide might have after 3 time steps:

[ V1,1 | V2,1 | V3,1 | V4,1 | V1,2 | V2,2 | V3,2 | V4,2 | V1,3 | V2,3 | V3,3 | V4,3 ]

And then after another one

[ V1,4 | V2,4 | V3,4 | V4,4 | V1,2 | V2,2 | V3,2 | V4,2 | V1,3 | V2,3 | V3,3 | V4,3 ]

Then the application, if it wants to read say three consecutive samples from a single channel, will always have to skip at least three values, which will complicated the code (code for handling a single channel needs to know how many total channels there are) and hurt memory locality.

On the other hand, a striding implementation might have after 3 steps:

[ V1,1 | V1,2 | V1,3 | V2,1 | V2,2 | V2,3 | V3,1 | V3,2 | V3,3 | V4,1 | V4,2 | V4,3 ]

And then after another one

[ V1,4 | V1,2 | V1,3 | V2,4 | V2,2 | V2,3 | V3,4 | V3,2 | V3,3 | V4,4 | V4,2 | V4,3 ]

and single-channel processing only needs to care about a single channel, stays within a single contiguous region, and if it ensures it only processes at multiples of the block size then it doesn't need to do any jumping around at all.

jonnew commented 8 years ago

Thanks for this, Shea. As we discussed, this is especially important for when using multiple processors to handle data after it is collected since the division of labor between processors naturally falls into groups of channels.

The current user-land API specification specifies nothing about the nature of data being produces by a device except its location and block size:

int oiReadStream(oiContext c, int port, int stream, int nbytes, void *data)

The reason for this is that we did not want to make predictions about the various types of data that would be collected with the system. For instance, if people collecting calcium imaging data were to use the system, then the alignment specialization you propose might be a performance hindrance rather than a help.

You mention that alignment should be a configurable option in the driver. How would one go about that? Do you think we can make this configurable from the user-level application at runtime?

shlevy commented 8 years ago

I started answering your question and realized that, from a very generic viewpoint, there are a number of possible parameters relating to how memory is filled by the driver spatially that you could be asking about:

  1. The type of an individual value. Maybe for some systems this is a double (representing voltage, say?) and others it's a boolean, for example. Technically could change per value source
  2. The number of value sources. In the electrophysiology case this is the number of channels. I've been assuming that all sources are sampled at the same time, but technically they could be processed independently
  3. The number of samples that can be written without overwrite. In the simplest case, the system just overwrites the same memory region each sample, but in the general case a circular buffer seems to make sense here
  4. Whether, in the multiple samples-without-overwrite case, the values are written sequentially in time per channel or sequentially in channels per time (this is the striding question)

1 and 2 are fixed in hardware, but it seems like both 3 and 4 are application specific and should be configurable by user-level code. My application might want to process each sample independently, so I set the overwrite number to 1 and don't care about striding. You want to process blocks of samples and have each source be handled contiguously, so you set the overwrite number to (some multiple of) the size of your processing blocks and set up striding.

aacuevas commented 8 years ago

One thing to have in mind is that we aren't specifically targeting multichannel systems with our API, but just raw data streams. The data could be raw channels from an ephys system, frames from a video capture device, multimedia streams or any kind of already processed digital data blocks from a device that does its own data processing. The way the data is sent is, thus, not the responsibility of the driver but of the device/firmware sending the data.

Striding options, if I'm understanding this correctly, would mean the driver mangling the data received by the device. As far as I know, that doesn't work well with the way DMA transfer data (in blocks which in most cases won't match the "block size" of the actual data). Mangling data here is also a bad idea, since we could have multiple readers accessing the same low-level buffer, each with their own requirements. Where this could be done, albeit with a performance loss (a flat memcpyis always faster than juggling with the data) is in the oiReadStreamfunction, which copies data from the low-level buffer into a local buffer.

Enter the oiSetStreamAttributes function. This hardly defined function can alter the way the writeStream and readStream calls work. In our idea for the default mode, the DMA just copies data into the memory buffer as soon as it is available, then the user calls oiReadStream with a number of desired bytes and the call returns reading as many bytes as it could up to the specified number (returning the actual number of read bytes, without blocking). Changing the attributes we could make, for example, the read call to block until the desired number of bytes is available (or a timeout triggers). So, for example, we could add some striding options here and let the read function do the transposing work for you, instead of requiring an extra memory operation in the user code. In fact, I quite like this approach since we can make it quite generic and it could improve performance in some cases.

shlevy commented 6 years ago

Closing this to clear up my open issues list, please open a new one if interested in pursuing this.