Sometimes wrong width in allocation size

pfitzseb / ProfileCanvas.jl

MIT License

87 stars 6 forks source link

Sometimes wrong width in allocation size #31

Open charleskawczynski opened 1 year ago

charleskawczynski commented 1 year ago

I just saw a flame graph where the width of the flames increase, and I don't think that's supposed to happen:

I looked at _bcs1 and it also claims 144 bytes. I don't think this is a public build, but I'm going to leave it here for convenience so that I can help answer questions about any other flames: https://buildkite.com/clima/climaatmos-ci/builds/10766#0188f5d4-9d79-4469-a5ed-e39286c5f014

Unfortunately, I don't really have a reproducer. More breadcrumbs for myself: this was found by increasing the sample rate in the ClimaAtmos callbacks flame graph.

charleskawczynski commented 7 months ago

I just ran a job that uses CUDA, and I'm seeing this kind of allocation incorrect width everywhere:

@maleadt, any idea what's going on here with allocations?

charleskawczynski commented 7 months ago

Maybe I'm way off--I have no idea what module "execution.jl" is in, let alone the package.

That said, I snooped around and looks like cufunction will always allocate because the dynamic input object (tt) gets put into the type space as kernel = HostKernel{F,tt}(f, fun, state). But, this seems like it would always incur allocations?

maleadt commented 7 months ago

I wouldn't know how CUDA.jl can cause this, but I'm not familiar with the inner workings of this profiler.

charleskawczynski commented 7 months ago

I assume it’s a bug in the profiler or ProfileCanvas.jl.

Sorry I was unclear @maleadt, I was asking if you have any idea why CUDA.jl is allocating in some of these places (e.g. HostKernel), regardless of the alignment bug.

maleadt commented 7 months ago

If arguments need to be converted (e.g. CuArray->CuDeviceArray), that requires allocations of objects. Furthermore, arguments are stored in heap boxes for the CUDA driver to read. So launching kernels without CPU allocations is not possible or very hard to achieve (and also not needed, as allocations of small objects like that is very fast, especially compared to the time it takes to actually launch a kernel). HostKernel object allocations should be able to get elided though.

If any of this turns out to be an actual performance problem, please file issues on CUDA.jl.

pfitzseb commented 7 months ago

Uploading the result of Serialization.serialize(Profile.fetch()) should be enough for me to look into this, if you can't figure out a way for me to run a reproducer.

charleskawczynski commented 7 months ago

Ok, since the job requires GPU resources, I think I can try to upload Serialization.serialize(Profile.fetch()).