Open mormj opened 3 years ago
Some quick research shows that flatbuffers allow for custom memory allocators. So we could allocate a flatbuffer using Cuda unified memory. Since we run in threads, once the memory is on the GPU, any block in the same flowgraph would be able to access it on the GPU (which makes for some really cool gpu processing flows. )
Just note that if we were to transfer the data between processes, we would have to pull the data of the GPU, serialize, deserialize and then push it back to the GPU. This could lead to confusingly slow processing in cases where you are using multiple flowgraphs connected over zmq.
Can the underlying memory structure be in GPU device memory so that packets of data residing in GPU memory can be passed around like a thrust vector?