kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.18k stars 5.32k forks source link

Add sctd++ smart pointers for device memory #4542

Open kkm000 opened 3 years ago

kkm000 commented 3 years ago

I want to:

  1. Unsee this:
    StateId* d_arc_nextstates_;
    d_arc_nextstates_ = static_cast<StateId *>(
      CuDevice::Instantiate().Malloc(arc_count_ * sizeof(*d_arc_nextstates_)));
  2. Never spend another day hunting for a missing call to this:
    KALDI_ASSERT(d_arc_nextstates_ &&
               "Please call CudaFst::Initialize() before calling Finalize()");
    CuDevice::Instantiate().Free(d_arc_nextstates_);

Instead, I want simply this:

std::vector<StateId, CuDevice::Allocator> d_arc_nextstates_;

@hugovbraun, @danpovey, are you with me? Do you see any downsides? The guarantee of memory contiguity was added to std::vector spec starting with C++17, but I know of no implementation that would chunk vectors internally for any C++ language version.

danpovey commented 3 years ago

We have something- not often used- called Array, which does something a little like that. I'm pretty sure it's a bad idea to attempt to overload standard library things for GPU, because they have so much going on, and the contents must be assumed to live on CPU.

On Thu, May 27, 2021 at 6:41 PM kkm000 @.***> wrote:

I want to:

  1. Unsee this:

    StateId d_arcnextstates; d_arcnextstates = static_cast<StateId >( CuDevice::Instantiate().Malloc(arccount sizeof(d_arcnextstates)));

  2. Never spend another day hunting for a missing call to this:

    KALDI_ASSERT(d_arcnextstates && "Please call CudaFst::Initialize() before calling Finalize()"); CuDevice::Instantiate().Free(d_arcnextstates);

Instead, I want simply this:

std::vector<StateId, CuDevice::Allocator> d_arcnextstates;


@hugovbraun https://github.com/hugovbraun, @danpovey https://github.com/danpovey, are you with me? Do you see any downsides? The guarantee of memory contiguity was added to std::vector spec starting with C++17, but I know of no implementation that would chunk vectors internally for any C++ language version.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO362ZRNL4V4JUK5CKLTPYOWFANCNFSM45T7L34Q .

kkm000 commented 3 years ago

@danpovey, it has nothing to do with "overloading standard library." It's the interface designed exactly for this use case. The template for vector has the second parameter which is defaulted to the default allocator. There are other allocators available, for example, a thread-aware allocator in TBB. Ours is a perfect use case for an allocator.

danpovey commented 3 years ago

But std::vector has many functions that would break if you were to do that. Like back(), operator [], and so on. The majority of them would crash.

On Thu, May 27, 2021 at 8:30 PM kkm000 @.***> wrote:

@danpovey https://github.com/danpovey, it has nothing to do with "overloading standard library." It's the interface designed exactly for this use case. The template for vector has the second parameter which is defaulted to the default allocator. There are other allocators available, for example, a thread-aware allocator in TBB. Ours is a perfect use case for an allocator.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4542#issuecomment-849592166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6H37QQAYFCOD4RUALTPY3P5ANCNFSM45T7L34Q .

kkm000 commented 3 years ago

You mean an attempt to dereference the pointers. Not back() per se, rather what you do with it. Sure. But that's the case with raw cudaMalloc'd pointers, too. I'm not saying that push_back() would work. Just using the std::vector as a container that guarantees alignment, deallocation on destruction, and not mixing host/device allocs, as we just did. Not a fully-functional vector, of course. We have the CuVector for that. Too bad stdc++ does not define a more blobsy container--that would be ideal.

BTW, I think that craftily crafted definitions of allocator's pointer and reference types may render these methods unusable at compile time.

I'll look at the Array that you mentioned, may be a better choice. But it should allow a specialization on something allocator-like that will ultimately call cudaMalloc and cudaFree (or rather our CuDevice methods).

Same for unique_ptr, too (it has a second parameter). It could be used for ownership and non-leakage of GRAM, too. There are subtleties involving arrays of unknown size, but worst of all, this won't save you from messing device vs host allocs. It can be taught to delete the object specially (I use unique_ptr to close file handles, for example), but there is no type-safety. If only CUDA defined a C++ pointer type, incompatible with the host memory pointer!

GPU memory leaks tubs and kegs of GRAM when a BatchedThreadedNnet3CudaOnlinePipeline instance is destroyed. 1GB+ per. This is why I'm looking for a safe containers for device memory. Raw CUDA memory API is so last century!

I found the source (a CudaFst is not deallocated, it has a method "Finalize" for that which is never called in the new pipelines), but am running into other funny issues with it. Create (it does 6 cudaMallocs), give it to CudaDecoder, destroy CudaDecoder, and all 6 pointers return cudaError_t 1 : "invalid argument" on an attempt to free them. Create and immediately free, everything is fine. If I do not free them, cuda-memcheck reports them as leaked, if I try to free them, they refuse. Second day of this fun in a row...

danpovey commented 3 years ago

Ha, OK, thanks... BTW, I seem to remember that stdlib containers are not supposed to be used in CUDA code for some reason.

On Thu, May 27, 2021 at 10:33 PM kkm000 @.***> wrote:

You mean an attempt to dereference the pointers. Not back() per se, rather what you do with it. Sure. But that's the case with raw cudaMalloc'd pointers, too. I'm not saying that push_back() would work. Just using the std::vector as a container that guarantees alignment, deallocation on destruction, and not mixing host/device allocs, as we just did https://github.com/kaldi-asr/kaldi/pull/4517. Not a fully-functional vector, of course. We have the CuVector for that. Too bad stdc++ does not define a more blobsy container--that would be ideal.

BTW, I think that craftily crafted definitions of allocator's pointer and reference types may render these methods unusable at compile time.

I'll look at the Array that you mentioned, may be a better choice. But it should allow a specialization on something allocator-like that will ultimately call cudaMalloc and cudaFree (or rather our CuDevice methods).

Same for unique_ptr, too (it has a second parameter). It could be used for ownership and non-leakage of GRAM, too. There are subtleties involving arrays of unknown size, but worst of all, this won't save you from messing device vs host allocs. It can be taught to delete the object specially (I use unique_ptr to close file handles, for example), but there is no type-safety. If only CUDA defined a C++ pointer type, incompatible with the host memory pointer!

GPU memory leaks tubs and kegs of GRAM when a BatchedThreadedNnet3CudaOnlinePipeline instance is destroyed. 1GB+ per. This is why I'm looking for a safe containers for device memory. Raw CUDA memory API is so last century!

I found the source (a CudaFst is not deallocated, it has a method "Finalize" for that which is never called in the new pipelines), but am running into other funny issues with it. Create (it does 6 cudaMallocs), give it to CudaDecoder, destroy CudaDecoder, and all 6 pointers return cudaError_t 1 : "invalid argument" on an attempt to free them. Create and immediately free, everything is fine. If I do not free them, cuda-memcheck reports them as leaked, if I try to free them, they refuse. Second day of this fun in a row...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4542#issuecomment-849685839, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4XAGUNQZYRUBBTHATTPZJ2BANCNFSM45T7L34Q .

hugovbraun commented 3 years ago

So that code is older, but in more recent code we've been using CuVector or CuMatrix to achieve that goal. Could we use that instead? For code where we know for sure that cuda is available (and so Cu* objects will allocate cuda memory), the behavior is the same

jtrmal commented 3 years ago

Yeah, that makes sense to me. y.

On Tue, Jun 1, 2021 at 11:09 PM Hugo Braun @.***> wrote:

So that code is older, but in more recent code we've been using CuVector or CuMatrix to achieve that goal. Could we use that instead? For code where we know for sure that cuda is available (and so Cu* objects will allocate cuda memory), the behavior is the same

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4542#issuecomment-852446492, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX3NIMPWBODUXYQ6E4LTQVEA7ANCNFSM45T7L34Q .

kkm000 commented 3 years ago

std::vector would still be too much abuse of semantics. You can call push_back and... I do not really know what happens. The allocator has separate methods for allocation and deallocation, and methods for initializing objects in place (akin to the placement new and delete). The vector can copy its allocated memory when resizing, but it has no idea that the memory is special.

I am not sure, need to look into it better to understand.

The unique_device_ptr allocators, however, can be implemented right away (see https://github.com/kaldi-asr/kaldi/pull/4548#issuecomment-853211261). There is no downside to it, as far as I can tell.

galv commented 3 years ago

Do you see any downsides?

There are unfortunately a lot of downsides to your proposal.

As Dan mentions,

BTW, I seem to remember that stdlib containers are not supposed to be used in CUDA code for some reason.

The "some reason" is that STL code is not annotated with the device pragma (since it is basically unaware of cuda).

So effectively, what that means is if you define a CudaAllocator template class that defers to cudaMalloc and cudaFree (or kaldi's own gpu memory allocator, or CUB's, or whatever), and then instantiate it and write code like this:

std::vector<int, CudaAllocator<int>> a;
// initialize a here
a.back() // seg fault

the code generated will be for the CPU, such that it will get a segfault when calling "back()".

You could actually use cuda unified virtual memory for your allocator so that both the CPU and GPU have virtual addresses for your allocations. Just an FYI.

Some helpful links, since I've thought about this before:

https://www.youtube.com/watch?v=5UVeh4_5B8I (discussion of how polymorphic resource allocators were used in the thrust library. polymorphic resource allocators are a specific kind of allocator that allow for dynamic run time polymorphism for selecting your allocator, rather than static polymorphism) https://github.com/NVIDIA/cuCollections (somewhat limited, but built by some colleagues Hugo and I know and actively prioritized internally) https://www.youtube.com/watch?v=-ENnYEWezKo (probably the most helpful) https://www.youtube.com/watch?v=tcNkAtQ9ibw (Good discussion of how the existing CPU-based APIs in the standard library are not so great if you copied them directly to CUDA. Concurrent datastructures require "bulk" operations)

I do think std::unique_device_ptr is a-okay. I expect you can actually implement it simply as a specialization of std::unique_ptr. But this is getting a bit long-winded.

galv commented 3 years ago

Thrust is the best bet for a device_vector container:

https://github.com/NVIDIA/thrust

kkm000 commented 3 years ago

There are unfortunately a lot of downsides to your proposal.

Thanks for your thoughts, this is why I essentially agreed with the other Dan. :) It's not the back() or other attempt to dereference a device pointer on host that worries me (“Doctor, it hurts when I do this” — “So, don't do this”). We all know not to do this. Rather, these are things like resizing, which probably cannot be handled properly with stdc++ containers. Or, at the very least, I do not know how to tell them to.

I do think std::unique_device_ptr is a-okay. I expect you can actually implement it simply as a specialization of std::unique_ptr.

Thanks! Actually, they are already handling the CudaFst's device memory as a POC :)

https://github.com/kaldi-asr/kaldi/blob/497e284408658cee46f52c5883e4d2146da06258/src/cudadecoder/cuda-fst.h#L86-L96

https://github.com/kaldi-asr/kaldi/blob/497e284408658cee46f52c5883e4d2146da06258/src/cudadecoder/cuda-fst.h#L123-L127

I'll try to improve it, so that they (a) cannot be dereferenced and (b) cannot be moved from or into the unique_ptr with the other deleter.

But this is getting a bit long-winded.

Not at the slightest, conversely, thank you for all the info! I'm very spotty on CUDA; some parts I understand fairly well, some not at all. I've never tried to learn and understand it systematically.

I though about the std::pmr, I'm using them with Intel's TBB, so I should certainly watch this and read the thrust code. 'fraid if I propose "why don't we guys use these cool polymorphic allocators?" the first thing @danpovey would do is swallow me whole right on the spot... :) Also, the whole kaboodle is C++17, and we just upgraded to C++14.

While I have you, a question, if you don't mind: why does not the CUDA decoder use UVM? Let's take the same CudaFst as a simplified example: 6 vectors in RAM, exactly copied into 6 vectors on device using the API. It seems like a great use case, but since you guys do not do that, seems you know something I do not... What's the matter with it?

hugovbraun commented 3 years ago

While I have you, a question, if you don't mind: why does not the CUDA decoder use UVM? Let's take the same CudaFst as a simplified example: 6 vectors in RAM, exactly copied into 6 vectors on device using the API. It seems like a great use case, but since you guys do not do that, seems you know something I do not... What's the matter with it?

Control on where the data is. Using the (pinned) host memory from the device is possible but very slow (pci-e speed). So you need a local copy on the GPU memory. You can rely on the UVM driver to take care of this, but you end up with a lot of page faults, slowing down GPU kernels. In theory, you could use hints, but for the use case here, it's better to just keep two copies, "simple is better". UVM could be useful for a case of a giant FST which does not fit on the GPU memory at all (but this remain to be seen).

kkm000 commented 3 years ago

@hugovbraun, thanks, I see. So you cannot tell it "I'll need this data on both the host and device all the time, and, by the way, copy it right now."

BTW, I don't know how it translates to the Yale matrix form of the FST, but during normal LVASR CPU decoding, less than 10% of the HCLG machine arcs get touched in practice, even on an almost 1 hr long set. I tracked that when experimenting with modifying the graph on the fly (I added some symbolic arcs as dynamic branching points, kinda what Dan later did with the grammars, but much rougher a prototype).

csukuangfj commented 2 years ago

Instead of doing

std::vector<StateId, CuDevice::Allocator> d_arc_nextstates_;

How about wrapping it inside a custom class, e.g.,


template<typename T>
class Foo {
public:
  // we can control which methods should be exposed.
 // For instance, if `back()` is error-prone, we just don't expose it.
private:
  std::vector<T, CuDevice::Allocator> data_;
};
galv commented 2 years ago

Hi @csukuangfj unfortunately, std::vector's operator[] (as well as at()) are not marked as device functions, so cuda kernels cannot call the std::vector methods.

You could use std::vector as a container type as you describe above, and then just implement all your methods in terms of the pointer returned by data_.data(), though. However, this is not much different from using a std::unique_ptr specialized for array types (although I suppose you do get copy semantics then).

csukuangfj commented 2 years ago

I guess the above comments miss the point of the motivation why @kkm000 wants to do this.

What he wants is to avoid memory leaks, not the usages ofoperator()[], at(), back(), data(), etc. Also, he does not mean to use std::vector directly in a kernel.

He only wants to use std::vector to manage memory.

The only downside I can think of is that the constructor of std::vector<int, CuDevice::Allocator> will initialize the allocated memory, which causes segfault since the memory is on GPU.

danpovey commented 2 years ago

I think you are not allowed to use stl container types on GPU, period, according to NVidia.

On Sun, Oct 17, 2021 at 10:24 AM Fangjun Kuang @.***> wrote:

I guess the above comments miss the point of the motivation why @kkm000 https://github.com/kkm000 wants to do this.

What he wants is to avoid memory leaks, not the usages ofoperator()[], at(), back(), data(), etc. Also, he does not mean to use std::vector directly in a kernel.

He only wants to use std::vector to manage memory.

The only downside I can think of is that the constructor of std::vector<int, CuDevice::Allocator> will initialize the allocated memory, which causes segfault since the memory is on GPU.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4542#issuecomment-945037501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6CYNZL7RXRV53ISULUHIXUBANCNFSM45T7L34Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.