Insane indexes in CUDA decoder h_infotoken_concat_ when 2 CudaDecoder instances lifetime overlaps

@hugovbraun, the question is most likely for you. I am not sure I correctly understand the behavior of H2H threads (which are separate from the post-processing thread pool), and is it intended. Some of this applies only if CudaDecoder::n_threads_used_ > 1, but even with just one thread (my case, currently), there are possible issues.

All the n_threads_used_ H2H threads are sitting there waiting for work

https://github.com/kaldi-asr/kaldi/blob/ea2b433dfdd7eab4c7a665ef46f7e87a2e4a782e/src/cudadecoder/cuda-decoder.cc#L1831-L1836

signaled by the n_h2h_main_task_todo_cv_ CV

https://github.com/kaldi-asr/kaldi/blob/ea2b433dfdd7eab4c7a665ef46f7e87a2e4a782e/src/cudadecoder/cuda-decoder.cc#L1978-L1987

and to initiate this work, CudaDecoder::LaunchH2HCopies() increments n_h2h_task_not_done_ by n_threads_used_, sets n_h2h_main_task_todo_ to equal exactly n_threads_used_, and signals all H2H threads to start their thing:

https://github.com/kaldi-asr/kaldi/blob/ea2b433dfdd7eab4c7a665ef46f7e87a2e4a782e/src/cudadecoder/cuda-decoder.cc#L1814-L1825

The H2H thread(s) decrease n_h2h_main_task_todo_ by the same n_threads_used_ immediately and proceeds. n_h2h_task_not_done_ is decremented near the end of ComputeH2HCopies() under a mutex, to complete the CV semaphore pattern.

Is this an intended parallelism pattern of ComputeH2HCopies()? I'd normally expect .notify_one() in place of .notify_all(). If all (more than one) threads were blocked on the CV, they would start doing the same thing in parallel. I see that they are more or less safe to run in parallel under a very specific protocol (e.g., lanes2channels_todo_ and many other member variables should not change externally except between WaitForH2HCopies() and LaunchH2HCopies(), and should not be expected to even load consistently when read, provided that this bracketing is globally data-race-free). The bracketing is in CopyMainQueueDataToHost() and is guaranteed, provided if that method is always called from the same single thread or from a critical section with global fences (which it's not). The function in turn called from AdvanceDecoding() which is part of the public interface. This is the point where I'm starting to get a bit nervous. :) Am I reading the code correctly?

The contracts are quite extensive. From the maintainability point of view, even if the code is correct under these contracts (including the the external AdvanceDecoding(), so it spreads beyond this single class), the correctness is quite brittle, given how many member variables are touched within ComputeH2HCopies().

I have been hunting a bug that results in nonsensical value of offset returned here

https://github.com/kaldi-asr/kaldi/blob/ea2b433dfdd7eab4c7a665ef46f7e87a2e4a782e/src/cudadecoder/cuda-decoder.cc#L1289-L1292

and the offset is 10 times the size of the h_all_tokens_extra_prev_tokens_[ichannel] vector. I spent a week on this still without understanding where the value is ultimately coming from. There are apparently more implicit contracts on calling this code that I infer. Or maybe a genuine bug hiding somewhere. It's obviously set by the first MoveConcatenatedCopyToVector call in this part of the H2H worker

https://github.com/kaldi-asr/kaldi/blob/ea2b433dfdd7eab4c7a665ef46f7e87a2e4a782e/src/cudadecoder/cuda-decoder.cc#L2037-L2052

but h_all_tokens_extra_prev_tokens_ is read on a thread pool thread. At this point, the analysis explodes from a single lane to global data structures, and I do not even understand what prevents another H2H job from being launched at the thread that calls AdvanceDecode(), which would invalidate h_all_tokens_extra_prev_tokens_.

I'd try to at the least document all the contracts explicitly, but I'm afraid I understand only a half of them, if I'm lucky. :) Maybe we could put some effort into making them all explicit?

Thanks for taking the time to look at that code @kkm000 ! That part of the code is something I've been wanting to clean up for some time, so your concerns are well founded.

I am not sure I correctly understand the behavior of H2H threads (which are separate from the post-processing thread pool), and is it intended.

Those H2H copies are very latency sensitive, which is why they are handled by dedicated threads (and not by the thread pool). We want to use just enough threads to saturate the host memory bandwidth (and perform those memcpies while the GPU is doing something else).

Is this an intended parallelism pattern of ComputeH2HCopies()? I'd normally expect .notify_one() in place of .notify_all(). If all (more than one) threads were blocked on the CV, they would start doing the same thing in parallel.

So, they are indeed all executing ComputeH2HCopies, but that method contains lines like this one :

while ((ilane = n_acoustic_h2h_copies_todo_.fetch_sub(1)) >= 0) {

Where each thread grabs an unique task. This is why we notify all, because they are all processing the data in parallel.

lanes2channelstodo and many other member variables should not change externally except between WaitForH2HCopies() and LaunchH2HCopies(), and should not be expected to even load consistently when read, provided that this bracketing is globally data-race-free). The bracketing is in CopyMainQueueDataToHost() and is guaranteed, provided if that method is always called from the same single thread or from a critical section with global fences (which it's not).

Those 3 functions should always only be called by the master thread (the one calling AdvanceDecoding). Am I missing something here?

The contracts are quite extensive. From the maintainability point of view, even if the code is correct under these contracts (including the the external AdvanceDecoding(), so it spreads beyond this single class), the correctness is quite brittle, given how many member variables are touched within ComputeH2HCopies().

I agree, I want to simplify that code, and regroup into a single struct all variables used for communications between the master thread and the h2h threads

Some of this applies only if CudaDecoder::n_threadsused > 1

">1" or > 0 ? Making sure this is a typo

Please let me know if all of this makes sense :)

I have been hunting a bug that results in nonsensical value of offset returned here

I can definitely help with that. Any chance you could send me a model + audio file to repro ? Email is hbraun with the domain nvidia.com.

Btw, if you are aware of some magic trick that std::memcpy doesn't know about to easily saturate the host memory bandwidth, that would definitely interest me.

@hugovbraun, thank you for the detailed answer. Looks like that at the least what I understand, I understand correctly. :)

Those 3 functions should always only be called by the master thread (the one calling AdvanceDecoding). Am I missing something here?

No, nothing at all, it's just need to be documented, I believe. There is a single feeder thread that in fact does that.

">1" or > 0 ? Making sure this is a typo
while ((ilane = n_acoustic_h2h_copiestodo.fetch_sub(1)) >= 0) {

That's related: in the > 1 case an extra care is required so that the threads split work like this; but even if there is only 1 thread, there is still a potential for data races between it and other threads.

I want to simplify that code, and regroup into a single struct all variables used for communications between the master thread and the h2h threads

I was thinking in the same vein. There is stuff which must not be touched while the H2H transfers are in progress, and if it were grouped up, it would be much easier both to understand and change. I had to track the use of every member variable that is used by that section of code, and their names are both long and similar to other names. Even a simple something_.n_acoustic_h2h_copies_todo would go a long way toward readability, IMO.

Btw, if you are aware of some magic trick that std::memcpy doesn't know about to easily saturate the host memory bandwidth, that would definitely interest me.

Interesting. For transfers significantly exceeding the CPU L3 size, it should be probably close—at the first sight, memcpy should be memory-bound. Assuming 4-way DDR4-2600, and Skylake-X or above, the ballpark should be 50GB/s within a NUMA node. What discrepancy do you see? What is the size of transfers? What's the CPU type and is this a NUMA system? The hardware factors that significantly affect memory transfer are power/clock control and NUMA memory locality. Can't readily think of anything else.

I can definitely help with that. Any chance you could send me a model + audio file to repro ?

Thank you, but the stuff blows up in my server code, which also has a network API, on exactly the first lattice emission of the third model which I activate in a row (both A, B, A and A, B, C reproduce; the difference in the A,B,C case is that the files are loaded fresh from disk, in the former, the pre-loaded objects in RAM are reused). The pipeline and dynamic batcher objects are destroyed for reloading, but the GPU context is created once, at the start of the service. I do not think I could come up with a simple stand-alone repro. I found a way to reproduce it 100% at the same location with approximately the same values that I see, but even to get to this point, I experimented a lot. Timing seems to be all important—I think I may be running into running into an interesting race.

By adding more and more assert()s I just caught it in the act right at this point

https://github.com/kaldi-asr/kaldi/blob/ea2b433dfdd7eab4c7a665ef46f7e87a2e4a782e/src/cudadecoder/cuda-decoder.cc#L2037-L2047

with this assert() failing after this MoveConcatenatedCopyToVector at line 2045

    assert(h_all_tokens_info_[ichannel].size() <= 1 ||
           h_all_tokens_info_[ichannel].back().arc_idx >= 0 ||
           ((h_all_tokens_info_[ichannel].back().prev_token -
             h_all_tokens_info_[ichannel].back().arc_idx) <=
            h_all_tokens_extra_prev_tokens_[ichannel].size()));

with that humongous offset value showing up at the last array element added above this code, by the MoveConcatenatedCopyToVector under the comment "step 2":

(gdb) p h_all_tokens_info_[ichannel].size()
$10 = 1106
(gdb) p ichannel
$11 = 799
(gdb) p ilane
$12 = 0
(gdb) p h_all_tokens_info_[ichannel][1100]
$13 = {prev_token = 1076, arc_idx = 44885545}
(gdb) p h_all_tokens_info_[ichannel][1103]
$14 = {prev_token = 1087, arc_idx = 44707772}
(gdb) p h_all_tokens_info_[ichannel][1104]
$15 = {prev_token = 1090, arc_idx = 44698222}
(gdb) p h_all_tokens_info_[ichannel][1105]
$16 = {prev_token = 239318, arc_idx = -2}

For some reason, it's always around 250000. I don't think arc_idx values over 40 million are normal either, but I do not really have a feel for it. If these are arcs in the original HCLG machine, possible: it has 15M states.

Nothing wrong here; I have only one thread, we're on the last iteration:

(gdb) p n_extra_prev_tokens_h2h_copies_todo_
$26 = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = -1}, static is_always_lock_free = true}
(gdb) p lanes2channels_todo_
$27 = std::vector of length 3, capacity 400 = {799, 798, 797}

The incorrect value did go into h_all_tokens_info_ from h_infotoken_concat_ by way of MoveConcatenatedCopyToVector at the step 2. By its size 1106, it should have been appended 1105 elements; h_main_q_end_lane_offsets_ look good

(gdb)  p h_main_q_end_lane_offsets_[0]
$29 = 0
(gdb) p h_main_q_end_lane_offsets_[1]
$30 = 1105

and the values from h_infotoken_concat_ match what I printed above ($13-$16), with the seed token from the "channel" 801 (this is how it should be, right?)

(gdb) p h_all_tokens_info_[ichannel][0]
$31 = {prev_token = -2147483648, arc_idx = -1}
(gdb) p *h_infotoken_concat_
$32 = {prev_token = 0, arc_idx = 661}
(gdb) p h_all_tokens_info_[ichannel][1]
$33 = {prev_token = 0, arc_idx = 661}
(gdb) p *(h_infotoken_concat_+1102)@3
$34 = {{prev_token = 1087, arc_idx = 44707772}, {prev_token = 1090, arc_idx = 44698222}, {prev_token = 239318, arc_idx = -2}}

So this insane offset value did in fact come from h_infotoken_concat_[1104], but I have no idea how it could have gotten there. The normal way would be via LaunchD2HCopies() with the value computed on the device, but I have no proof that it did in fact happen. Looks very likely, but I question everything :) Didn't go into the device code yet.

Do you have any ideas what to look at?

@hugovbraun, just a though about memory copy. This may or may not fit your pattern of use, but the fastest memcpy is no memcpy at all. On both Linux and Windows, it is possible to allocate (page-aligned or huge-page-aligned) virtual memory, and then map a page range of it at a new address with a copy-on-write flag. See, respectively, memfd_create(2), mmap(2) MAP_PRIVATE flag; CreateFileMapping() with INVALID_HANDLE_VALUE for a pagefile-backed allocation; and MapViewOfFile(), FILE_MAP_COPY flag. Dunno about macOS; memfd_create(2) is Linux-only AFAIK, but the closest alternative, shm_open(3), is POSIX (MAP_PRIVATE is, too, the man says).

@hugovbraun, I did not advance much trying to figure out where the corrupted value is coming from. The InfoToken of {250000-ish, -2} seems to pop up at the end of most arrays copied from the device. I'm not sure of all; I think I saw the assertion with lane index matching the second from the end utterance in the unpacked h_all_tokens_info_ array, but most commonly it's the first one encountered. What's amazing is the circumstantial evidence.

As I mentioned, I'm reloading the models in the GPU. Here's the rub. I'm working with a T4, 2 pool threads and 1 H2H thread. This means I can easily run 3 parallel pipelines w/o overrunning the device RAM (didn't try 4, probably can, too). If I load them one by one (each with separate pool, batcher, decoder and everything). they work just fine. Slow, I'm overloading the small device (it seems to peak at ~40 lanes, which suspiciously matches the number of its SMs), By "running a pipeline," or "running a model" ('fraid internal terms may sneak in; both mean the same thing), I mean everything: acoustic+transition model, FST, full set of threads (the feeder thread, a single H2H thread and 2 threadpool threads. Now, here comes some very weird findings.

The hell (as described above 👿 ) breaks loose if I destroy any of the pipelines. I can do that while other pipelines are under load, or when they sit idle; this does not matter. Only the fact that I stop the threads and destroy objects. Any other pipeline will exhibit the behavior I described (I assert upon seeing this out of whack offset as soon as h_all_tokens_extra_prev_tokens_ in the H2H thread: it's always the last one added to the array which has arc index of -2. so the arc_idx is treated as an offser, and the offset beyond any sensible array bounds, in the 250000-ish.
However, I can load and destroy however many pipelines fit into the GPU, and it all will be fine unless I also run some audio through any of them. From that point on, the assertion will happen.
In the original design of the server, the client wants to start the next "batch" (not our batch of packaged features, but a batch of audio files; let's call it Batch), and gives me a hint that the load of the current model is doing to end. I let stragglers drain, at the same time loading the next one. Currently, the new model will not allow any load until the current one stopped accepting load and drains itself, while the other, already loaded from disk into "normal" Kaldi object, is starting to load. There is a short time when first model threads co-exist with the new, next to become active pipeline. The state management is kinda behind-the-scenes and asynchronous, but this overlap is always present: I flag the client to open the floodgates, and issue unload on the other model at the same time. By design, it should have been larger (to allow stragglers straggle while already accepting load on the new pipeline), but the current switch criteria is zero remaining active channels in the "old" pipeline.
This is (temporarily, was) an expected workflow However, there is also an API to control loading and stopping models/pipelines explicitly. If, after running a Batch, I first delete the active pipeline, and only then load the next one, nothing bad happens. As many cycles as I want. I think I ran up to 50. As long as only one pipeline exists at a time, no crash, no weird data.

In summary, it is both (a) more than one pipeline churning or just waiting for audio and (b) any audio processed while the same GPU context is alive necessary to get this strange InfoToken corruption, but it starts to happen only after (c) any of the pipelines is destroyed.

I could not find any shared stale state either on the CPU or GPU. I added code to explicitly stop the threadpools, H2H thread and the feeder thread, and assert that there is in fact no pending work there. It's one of the most baffling bugs I'm squishing.

ASAN is silent (down to the point where the crazy offset is actually applied—that's how I caught it in the first place—but now I crash on an assert before it). I leak-checked neither GPU or CPU yet in the survivable cases that I described. I'm totally lost. Help!!!

@kkm000 so that's a confusing one. The most confusing part is that (a) it works if there is no overlap at all between pipelines and (b) the incorrect value has some logic in it. Is the bug deterministic? When you leak-checked the GPU, did you use memcheck? I'll re-read the code and try to make sense out of all this

As a temporary workaround, could you use separate processes instead of running all pipelines in the same binary ? If you start a MPS server, it could even yield better performance when overlapping pipelines.

Great idea for the memfd_create - I'm wondering if this could work with the granularity we need (with the POSIX version). I'll try to patch the h2h code for now

@hugovbraun, thanks for a reply, and I do not really understand GPU hardware and CUDA enough to understand what could affect what, and what would not. When I examine the CPU code, there obviously no sharing between the pipelines. When I look at the CUDA kernels of the decoder, the first thing I see is that most parts are the 11th level magic, and it will probably take me months actively working with CUDA to understand it fully (maybe I'm too optimistic tho), but certainly no global sharing of state is going on. Every kernel gets its data in a neat package of two structures. Nothing obvious pops up.

We have a good workaround for the issue, as the segments are relatively short, and waiting and launching w/o overlapping is not an issue. We're targeting smaller instances with only a T4 accelerator, which has by far the best price/performance. A few seconds to drain one load, stop one model, start the next (already in RAM) and put it under load is not a big deal. The start part takes 1-2 seconds, and the drain is load dependent, but is also in the units of seconds. I just wanted to make it as efficient as it gets without spending too much effort; we have other constraints that pretty much guarantee that the runs of the model are long enough to comfortably prepare all the parts of the pipeline for the run after next.

The MPS solution would certainly work for a larger accelerators and the number of cores, but there are other things, such as the common network API already used by CPU-only loads, shared codebase that I do not want to diverge etc; the small tradeoff—I can look at it as a failed attempt at a small improvement in the overall throughput—matters. Each pipeline provides enough occupancy, and I'm rather facing a reverse problem: a small RAM in the instance allows only so much buffering, and I have to force clients into a backoff refusing calls with an overload error as the postprocessing pool comes to capacity. It's simpler to run more instances, in the end, than deal with quickly shifting dynamic loads on a faster accelerator.

As for the two simultaneously working decoders, it would certainly be helpful to understand what is going wrong. I'm already dug deep enough into it to drop it on the floor; even if not for production, I certainly would like to understand what is going for myself.

I debugged it just yesterday with 2 identical models, and a single input audio stream of about 500ms long, that I could comfortably focus on h_infotoken_concat_ right at this point:

https://github.com/kaldi-asr/kaldi/blob/ea2b433dfdd7eab4c7a665ef46f7e87a2e4a782e/src/cudadecoder/cuda-decoder.cc#L2027-L2047

First, all (one short stream) data are processed by pipeline and emitted. This is certain, as it's a manual invocation of the command-line client tool that I threw together for debugging, and I see the recognition result printed before even starting the command again. Then I start it with the same file and the other model ID requested. Both model are loaded in RAM, and the next-to-go is not touching on anything CUDA-related: an Nnet3 acoustic model, an i-vector extractor, the normal Fst HCLG etc. At this next request, the 2nd pipeline is created, initialized and started (meaning all its threads: the threadpool of 2, the H2H thread in the decoder and the feeder thread in the batcher), then, after this, I shut down the first one. In this debugging setup, I made sure that everything is shutting down synchronously w.r.t. the API dispatching thread. I experimented with a different shutdown order, by adding explicit Stop methods to all objects which start threads, but it made no difference. In the end, all threads are stopped (this is asserted) and objects are destroyed, and only at this point the API thread sends data to the batcher of the 2nd model. The 1st model's Kalsi/FST objects are left loaded in RAM, but it's decoder interface is positively destroyed.

I used the same setup to catch the leaks of the CudaFst objects previously, rotating the request for the the 3 models but without any audio at all.

So, back to our h_infotoken_concat_ at the breakpoint at the second MoveConcatenatedCopyToVector at 2045. Since the audio is same and the model is a clone, I'd expect to see more or less similar data there (and the array sizes and dynamics indeed is). The first hit of the breakpoint there is only 1 element in h_infotoken_concat_, and 0 in h_extra_prev_tokens_concat_ (from InitDecodingH2HCopies(), I think?); the second is the most interesting. For the first audio ever, I've got the counts of 1058 and 34, respectively, and the tokens look unalarming, although they are not deterministic between runs (not surprising either). This is what I got by looking at the first 100; the first index into the extra tokens array is at [3]:

$1 = {{prev_token = 0, arc_idx = 1969}, {prev_token = 0, arc_idx = 2013}, {prev_token = 0, arc_idx = 2036}, {prev_token = 8, arc_idx = -2}, [...snip...]

Then I launched the second model as described, and the sequence is slightly different but still normal, 1 and 0 at the first hit and 1055 and 36 on the second. The first multi-source token appears much later, but still the sequence is not surprising (and resulted in a correct recognintion of a single "no" in the end, although I paid very little attention to the decoded raw lattice, still too much data to make sense of), with the first multi-source arc at the indices 58 and 60:

p *(h_infotoken_concat_+0)@150
$2 = {{prev_token = 0, arc_idx = 957}, {prev_token = 0, arc_idx = 958}, [...snip...]
p *(h_infotoken_concat_+55)@10
$3 = {{prev_token = 0, arc_idx = 1324}, {prev_token = 0, arc_idx = 1325}, {prev_token = 0, arc_idx = 1326}, {prev_token = 0, arc_idx = -2}, {prev_token = 0, arc_idx = 1366}, {prev_token = 12, arc_idx = -5}, [...snip...]

In both 1st and 2nd runs I see the indices into the h_extra_prev_tokens_concat_ (prev_token for arc_idx < 0) coming close to its length, but never exceeding it. I think this is how it's supposed to be. (In an additional runs, I stopped at this BP many times at the first, known good run, and the extra indices pattern looked same: they were growing close to the extra token array length, but not quite reaching the very last elements, as if h_extra_prev_tokens_concat_ were streaming slightly ahead of h_infotoken_concat_, by no more than 10 elements. I stopped hitting continue in the debugger when the stream of h_infotoken_concat_ trickled down to no more than a dozen item per breakpoint, and that was at about half of its final length, 10K out of slightly over 20K).

The third run data is different. Again, the model switch protocol haven't changed. I also disable the breakpoint after the 2nd break and printing part of the arrays just copied from the device, to allow the decode to finish uninterrupted).

The counts are normal (again, 1 and 0 on the first hit, and 1062 and 39 the second). The arc_idx < 0 tokens in h_infotoken_concat_ are a train wreck. I got the earliest one only at h_infotoken_concat_[136], (I'm certain it's the first: I copied the gdb printout to the editor and searched for a '-'), pointing to the "index" 5947 in the array of so far 39 elements, and they were three in a row, growing by 2, nothing crazy, see below. As if they started counting up from some— random? old? but certainly wrong value. I do not know if it's notable or not that all the previous tokens are back-pointing to 0, but they are. I think it pretty much similar to the normal behavior, except for these strangely offset extra-token array indices.

p *(h_infotoken_concat_+0)@100
$29 = {{prev_token = 0, arc_idx = 83}, {prev_token = 0, arc_idx = 84}, {prev_token = 0, arc_idx = 85}, [...snip...] {prev_token = 0, arc_idx = 213}, {prev_token = 0, arc_idx = 214}}
. . .
p *(h_infotoken_concat_+130)@30
$33 = {{prev_token = 0, arc_idx = 251}, {prev_token = 0, arc_idx = 252}, {prev_token = 0, arc_idx = 253}, {prev_token = 0, arc_idx = 254}, {prev_token = 0, arc_idx = 255}, {prev_token = 0, arc_idx = 2049}, {prev_token = 0, arc_idx = 2054}, {prev_token = 0, arc_idx = 2055}, {prev_token = 0, arc_idx = 2058}, {prev_token = 0, arc_idx = 2100}, {prev_token = 0, arc_idx = 2113}, {prev_token = 5947, arc_idx = -2}, {prev_token = 5949, arc_idx = -2}, {prev_token = 5951, arc_idx = -2}, {prev_token = 0, arc_idx = 2847} ...snip...
p *(h_infotoken_concat_+135)@5
$34 = {{prev_token = 0, arc_idx = 2113}, {prev_token = 5947, arc_idx = -2}, {prev_token = 5949, arc_idx = -2}, {prev_token = 5951, arc_idx = -2}, {prev_token = 0, arc_idx = 2847}}

So these enormous indices start very early, and always break exactly in the 3rd decode. I caught them first using ASAN when they were dereferenced, then using an assertion that looked only at a single last element of the vector already having copied data into from h_infotoken_concat_, but it triggers quite deterministically on the 3rd run.

When I switched the order of destruction so that all threads of the old pipeline are stopped and objects are destroyed before creating/starting the next one, the strange behavior went away. I did not run a long test, but at the least 10 times of the same manual runs, with one then a couple dozen streams at a time all completed normally and resulted in expected one-best decodes that I'm printing on the client for debugging.

And the kernel code flies way above my head. :(

I'm wondering how randomly-timed calls to cudaDeviceSynchronize() could affect the decoding, if at all? What happens behind the scenes which is not explicit, is some outgoing threads have previously called CuDevice::Initialize(), and they are destroying per-thread handles to cuBLAS, cuRAND, cuSPARSE and cuSolver. I added debug prints there, and there are in fact two threads when the outgoing model stops which do in fact have non-null handles in the destructor and free them.

https://github.com/kaldi-asr/kaldi/blob/ea2b433dfdd7eab4c7a665ef46f7e87a2e4a782e/src/cudamatrix/cu-device.cc#L604-L622

I'm vaguely remember at least some of them do call cudaDeviceSynchronize(), or at the least did in previous CUDA versions (I'm using 11.3). Could this be a factor? Anything else, maybe, which is going on in the code that frees the handle of these libraries?

I'm compiling everything with -default-stream=per-thread, naturally.

Great idea for the memfd_create - I'm wondering if this could work with the granularity we need (with the POSIX version).

May be great or not so much. The page is simply mapped a second time as read-only at a different VM address. This does not count toward the working set size, and I think in hardware does not even affect lower caches coherence (L3, possibly L2). But any write to it makes the kernel copy the page via a page fault first, then redo the write. Then it becomes just a normal page.

It always works at a page granularity, 4K for the x86 architecture. It also has a huge page support; these are 2M, 4M or 8M 1G (Cascade Lake and later only).

The more I'm thinking about your question, the more interesting things I find. To a modern CPU, RAM is almost a peripheral. Sequential read of a few GB from RAM is only 3-5 times slower than reading the same amount from a PCIe NVMe disk drive. GPU-GRAM bandwidth is much, much higher. And cPU-RAM latency is horrible, 200-300 cycles. A DMA transfer from the GPU decoheres all cache levels.

So the slow part may not be actual memcpy()¹, I think, but the loading of the data into L3 from RAM. If you can stay within the L3 size during processing, assuming the whole CPU is yours (pretty reasonable in HPC) by affinizing threads to it, the bandwidth is 5 times higher, and the latency is 10 times lower. The L3 size ballpark is 4MB/core (Skylake-X), shared by all cores of a package. And only L2 can saturate the core bandwidth, again, in the ballpark. L3-sized workload requires multiple cores and judicious memory accesses, to ensure per-core L2 data locality.

¹ std::copy is marginally more efficient for larger transfers, and better for smaller transfers, because it has a good idea of data alignment, and, being inline, due to a better instruction set use than the precompiled libc. But it's all moot if you have an L3 miss.

@hugovbraun, I just realized that I'm trying to answer non really knowing the question. Can you tell more about the memcpy-related hot spot that you are trying to optimize?

For some reason, it's always around 250000. I don't think arc_idx values over 40 million are normal either, but I do not really have a feel for it. If these are arcs in the original HCLG machine, possible: it has 15M states.

@kkm000 Just a hunch, can you provide the number of arcs as provided from fstinfo? arc_idx>40 million seems reasonable to me, to clarify. An out degree of ~3 would imply 45 million arcs, which seems fine.

Some thoughts from an initial glance:

The cuda context could be the culprit. A possibility is that something isn't being destroyed properly in our destructors, so it is retaining something that it still thinks is valid.

An interesting thing you could do to check that is to surround your creation and destruction of BatchedThreadedNnet3CudaOnlinePipeline (if I understand correctly, you are creating and destroying it three times in your reproducer):

{
CUContext ctx;
CUDA_DRIVER_SAFE_CALL(cuDevicePrimaryCtxRetain(&ctx, CuDevice::device_id_)); // annoyingly, device_id_ is private
BatchedThreadedNnet3CudaOnlinePipeline pipeline(...);
CUContext popped_ctx;
CUDA_DRIVER_SAFE_CALL(cuCtxPopCurrent(&popped_ctx));
assert(ctx == popped_ctx);
}

However, I'm not 100% certain about that (explicit management of cuda contexts is fairly uncommon, so I don't know and it's too late at night for me to want to test this). An easier way is to run ltrace as I do here:

https://github.com/kaldi-asr/kaldi/issues/4501#issuecomment-832117285

If you can upload that output here of ltrace here, that would be great. But basically, we're trying to see if any more cuda api calls have non-zero return values. ltrace isn't the only way to do this (I believe some NVIDIA tools also can do this), but this should work so long as you are linking to shared library versions of the cuda libraries rather than static library versions.

One last thing. Try setting BatchedThreadedNnet3CudaOnlinePipelineConfig::num_decoder_copy_threads to 0 to verify that you cannot reproduce the issues when this is the case. If I am reading your messages correctly, you have already confirmed that, but I am not sure.

@galv, thanks a lot, I'll do all these tests. To clarify, I did not ever set BatchedThreadedNnet3CudaOnlinePipelineConfig::num_decoder_copy_threads to 0. I've always used 1. Easy to test.

The arc values are ok, I got an fstinfo report on the file, and there are about 52M arcs. I just did not expect it to be so large. So it's the crazy offset that pops out of nowhere on exactly the third decode is actually the only problem.

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

kaldi-asr / kaldi

Insane indexes in CUDA decoder h_infotoken_concat_ when 2 CudaDecoder instances lifetime overlaps #4556