Memory management - Githubissues

nicholas-leonard commented 11 years ago

Thrust has a nice pinned_allocator for allocating pinned memory using cudaMallocHost. We can use it to guide our own allocators. Each allocator will be assigned to a neural actor, i.e. a thread, making memory management thread safe. Each actor will be initialized with a big chunk of pinned, non-pinned and device memory. When needed, it will augment these with more chunks. We should find an existing memory allocator that works like this.

We need a configuration file like postgreSQL. I like how they manage memory. In our case, we would specify:

the number of devices (GPUs).
the amount of neural actors (threads) (each actor has access to all devices)
the size of memory allocations (pinned, non-pinned, device) per actor/device
the size of the initial memory allocation per actor/device
the maximum memory of each actor/device

nicholas-leonard commented 11 years ago

I am thinking of implementing a pinned memory allocator that could be used with std::vector and such. I am also looking for a nice matrix container.

nicholas-leonard commented 11 years ago

Alignment:

http://www.geeksforgeeks.org/structure-member-alignment-padding-and-data-packing/

Argument for batched allocation:

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#allocation

Pinned memory limits:

http://www.khronos.org/message_boards/showthread.php/7041-maximum-pinned-memory

nicholas-leonard commented 11 years ago

CUDA Programming Guide: 3.2.5.5.4. Implicit Synchronization Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread: ‣ a page-locked host memory allocation, ‣ a device memory allocation, ‣ a device memory set, ‣ a memory copy between two addresses to the same device memory, ‣ any CUDA command to the default stream, ‣ a switch between the L1/shared memory configurations described in Compute Capability 2.x. For devices that support concurrent kernel execution and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete: ‣ Can start executing only when all thread blocks of all prior kernel launches from any stream in the CUDA context have started executing; ‣ Blocks all later kernel launches from any stream in the CUDA context until the kernel launch being checked is complete. Operations that require a dependency check include any other commands within the same stream as the launch being checked and any call to cudaStreamQuery() on that stream. Therefore, applications should follow these guidelines to improve their potential for concurrent kernel execution: ‣ All independent operations should be issued before dependent operations, ‣ Synchronization of any kind should be delayed as long as possible.

nicholas-leonard commented 11 years ago

I can't seem to find a c++ pooled memory allocator on the internet that does not include segment metadata in the same memory as the pool. What I am looking for is a 2 pools. One for the metadata, another for the actual memory to allocator for data. I believe the easiest remaining course of action is to modify boost pool.

nicholas-leonard commented 11 years ago

Its too hard to modify.

nicholas-leonard commented 11 years ago

Instead, here is what we could do. Each matrix (layer) will manage its own memory cache. During matrix construction of sparse matrices, it will be possible to specify the maximum amount of device memory. This could be done by specifying the size or number of elements of each column, and the amount of columns to keep on the device. Or it could just be a matter of specifying the device memory footprint. Or by passing it a device memory pointer and size (its already allocated).

Data Fill

We can fill the matrix with data using a database pointer (Tables, Keys, Rows, etc). Or we can just pass a vector of vectors pointers. etc. I think we should focus on this second part for now.

Distributed Matrices

The size of a matrix is limited by the maximum amount of resources available on a compute node (A machine). In other words, a matrix cannot be distributed among many machines. This constraint will minimize inter-process communication and the complexity of our application. Although, we could eliminate it later.

A matrix can be distributed over many GPUs on the same compute node. Of course, these GPUs would have to be of equal power. And the

nicholas-leonard / hypermind

Memory management #2

Data Fill

Distributed Matrices