Open PercyLau opened 3 years ago
Hi Percy,
I'm not 100% sure that I understand what this API call should look like from your description, but are you looking to only run DTW, not all of DBA? And run it in parallel? The DTW code should be reentrant if you provide the optional cudaStream argument. If you can provide a concrete example of what the tensorflow wrapper should look like, I may be able to help more specifically.
Cheers,
Paul
Hi, Paul
Generally, the problem roots in the way that how tensorflow manages gpu/cpu resources. In particular, there are three issues.
Some envs only support virtual GPUs (vGPUs) for the distributed deployment of tensorflow. However, according to the latest CUDA tutorial, unified memory (i.e., cudaMellocManaged API) has not been supported by vGPUs. Is there any work around in your repo, like, to replace cudaMellocManaged with cudaMelloc and cudaCopyFrom?
Tensorflow in fact does not suggest users to manually allocate GPU memory to tensors. Instead, it provides an auto configuration class to dynamically allocate GridDim, ThreadDim, streams to each GPU OP. Each time it calls an OP, the tensorflow framework will determines how many GridDim, ThreadDim, streams for current OP. Does OpenDBA strictly relies on the values of GridDim, ThreadDim?
It seems DTWDistrance function requires cudaMellocPitched API. However, cudaMellocPitched API has never been used in tensorflow. I am afraid of cudaMellocPitched may result in unknown side effects. Does it matter for me to calculate dtw distances between one input time series and a set of time series?
The following is a minimal example of tensorflow warpper for dtw GPU OP. Some notes are included in the comment lines.
__global__ void DTWDistances(
T* dtwCostSoFar, T* newDtwCostSoFar, const size_t offset_within_second_seq,
const size_t pathMemPitch, const T* input_first_seq,
const size_t first_seq_length, const T* gpu_sequences,
const size_t num_sequences, const size_t second_seq_length,
ushort* pathMatrix /*ushort* dtw_paths*/, T* dtwPairwiseDistances);
template <typename T>
struct DynamicTimeSeriesWarpingFunctor<GPUDevice, T> {
void operator()(const GPUDevice& d, const T* input_x,
const size_t size_input_x, const T* input_Y,
const size_t size_rows_input_Y, T* dtw_scores) {
// initialize gpu devices
size_t total_count = size_input_x * size_rows_input_Y;
GpuLaunchConfig config = GetGpuLaunchConfig(total_count, d);
// thread = config.thread_per_block, which usually equals to
// d.maxGpuThreadsPerBlock() seq_stream = d.stream() grid =
// config.block_count, which usually equals to
// total_count/config.thread_per_block = size_rows_input_Y
size_t dtwCostSoFarSize = sizeof(T) * size_input_x * config.block_count;
// prepare some memspace
T* dtwCostSoFar = 0;
T* newDtwCostSoFar = 0;
cudaMallocManaged(&dtwCostSoFar, dtwCostSoFarSize);
cudaMallocManaged(&newDtwCostSoFar, dtwCostSoFarSize);
dim3 GridDim(config.block_count, 1, 1);
dim3 ThreadDim(config.thread_per_block, 1, 1);
for (size_t offset_within_seq = 0; offset_within_seq < size_input_x;
offset_within_seq += config.thread_per_block) {
size_t shared_memory_required = config.thread_per_block * 3 * sizeof(T);
const size_t pathMemPitch = size_input_x;
DynamicTimeSeriesWarpingKernel<<<GridDim, ThreadDim,
shared_memory_required, d.stream()>>>(
dtwCostSoFar, newDtwCostSoFar, (const size_t)offset_within_seq,
pathMemPitch, input_x, size_input_x, input_Y, size_rows_input_Y,
size_input_x,
(ushort*)0, // temperarily disable to cal dtw path
dtw_scores);
}
}
};
Best regards, Percy
Hi,
Thanks for the wrapper info.
As Tensorflow by defaults eats up nearly all the GPU memory, we'd need to figure out some alternative system for allocating from its C++ memory pool I guess instead of using CudaMalloc([device side])? I haven't investigated the Tensorflow memory API, but this would require some digging around. If you can point me to docs on this (how to reserve X amount of the GPU memory that tensorflow grabbed on startup) that would be great.
There should be no problem changing the grid dimensions, in fact we do this for the DEBUG mode already but through a C preprocessor macro (setting threadDim.x to 1 in one of the kernel callers).
The pitched memory allocation is critical to various internal operations, otherwise you get an "illegal memory access" CUDA error when reading a double from a 2D array not aligned to the right word boundary. That being said, we could manually "roll our own" pitched array using a cudaMalloc() allocation. I already do this for one of the Managed Memory allocations (which also does not support pitching).
Hi,
Thanks for the wrapper info.
- As Tensorflow by defaults eats up nearly all the GPU memory, we'd need to figure out some alternative system for allocating from its C++ memory pool I guess instead of using CudaMalloc([device side])? I haven't investigated the Tensorflow memory API, but this would require some digging around. If you can point me to docs on this (how to reserve X amount of the GPU memory that tensorflow grabbed on startup) that would be great.
As for memory management in tensorflow, I think you could refer to this material at the first glimpse: https://github.com/miglopst/cs263_spring2018/wiki/Memory-management-for-tensorflow. Generally, all the tensor creator inherits from an abstract base class as the "allocator" (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/allocator.h). Then, if required something in detail, you could find various other allocator by searching "public: Allocator" in tensorflow src folder. I think the comments of Googlers are fairly enough to fully understand the whole tensorflow project.
- There should be no problem changing the grid dimensions, in fact we do this for the DEBUG mode already but through a C preprocessor macro (setting threadDim.x to 1 in one of the kernel callers).
That sounds great!
- The pitched memory allocation is critical to various internal operations, otherwise you get an "illegal memory access" CUDA error when reading a double from a 2D array not aligned to the right word boundary. That being said, we could manually "roll our own" pitched array using a cudaMalloc() allocation. I already do this for one of the Managed Memory allocations (which also does not support pitching).
I am not 100% understand how to achieve this. But I am agree with that pitched memory allocation is quite critical. Very glad to see further cooperation.
Best, Percy
Hi, Paul
Very excellent implementation of dtw.
I am trying to integrate the DTW sub-module into well-known tensorflow as an gpu operator. For being compatible to the tf APIs, it is better to have an operator calculating the dtw costs as well as the dtw path matrices between a sequence x and a list of sequences, respectively. However, the implementation of approximateMedoidIndices function seems to suggest that cuda kernel DTWDistance does not support this task. Any advices would be greatly appreciating.
Best, Percy