Open charleskawczynski opened 1 year ago
I just ran a job that uses CUDA, and I'm seeing this kind of allocation incorrect width everywhere:
@maleadt, any idea what's going on here with allocations?
Maybe I'm way off--I have no idea what module "execution.jl" is in, let alone the package.
That said, I snooped around and looks like cufunction
will always allocate because the dynamic input object (tt
) gets put into the type space as kernel = HostKernel{F,tt}(f, fun, state)
. But, this seems like it would always incur allocations?
I wouldn't know how CUDA.jl can cause this, but I'm not familiar with the inner workings of this profiler.
I assume it’s a bug in the profiler or ProfileCanvas.jl.
Sorry I was unclear @maleadt, I was asking if you have any idea why CUDA.jl is allocating in some of these places (e.g. HostKernel), regardless of the alignment bug.
If arguments need to be converted (e.g. CuArray->CuDeviceArray), that requires allocations of objects. Furthermore, arguments are stored in heap boxes for the CUDA driver to read. So launching kernels without CPU allocations is not possible or very hard to achieve (and also not needed, as allocations of small objects like that is very fast, especially compared to the time it takes to actually launch a kernel). HostKernel object allocations should be able to get elided though.
If any of this turns out to be an actual performance problem, please file issues on CUDA.jl.
Uploading the result of Serialization.serialize(Profile.fetch())
should be enough for me to look into this, if you can't figure out a way for me to run a reproducer.
Ok, since the job requires GPU resources, I think I can try to upload Serialization.serialize(Profile.fetch())
.
I just saw a flame graph where the width of the flames increase, and I don't think that's supposed to happen:
I looked at
_bcs1
and it also claims 144 bytes. I don't think this is a public build, but I'm going to leave it here for convenience so that I can help answer questions about any other flames: https://buildkite.com/clima/climaatmos-ci/builds/10766#0188f5d4-9d79-4469-a5ed-e39286c5f014Unfortunately, I don't really have a reproducer. More breadcrumbs for myself: this was found by increasing the sample rate in the ClimaAtmos callbacks flame graph.