m4rs-mt / ILGPU

ILGPU JIT Compiler for high-performance .Net GPU programs
http://www.ilgpu.net
Other
1.39k stars 118 forks source link

[BUG]: cublasIsamax_v2 #1269

Closed Ruberik closed 1 week ago

Ruberik commented 2 months ago

Describe the bug

I get the following crash when I attempt to use Amax(ArrayView1D<float, Stride1D.General> input, ArrayView output):

Fatal error. System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception.
Repeat 2 times:
--------------------------------
   at ILGPU.Runtime.Cuda.API.CuBlasAPI_Windows_V11.cublasIsamax_v2(IntPtr, Int32, Void*, Int32, Void*)
--------------------------------
   at ILGPU.Runtime.Cuda.API.CuBlasAPI_Windows_V11.Isamax_v2(IntPtr, Int32, Void*, Int32, Void*)
   at ILGPU.Runtime.Cuda.CuBlas`1[[ILGPU.Runtime.Cuda.CuBlasPointerModeHandlers+ManualMode, ILGPU.Algorithms, Version=1.5.1.0, Culture=neutral, PublicKeyToken=null]].Amax(ILGPU.Runtime.ArrayView1D`2<Single,General>, ILGPU.ArrayView`1<Int32>)
   at Gotham.FurrowUtilities.BSGpu.ComputeTargetSpotStrikeRatio(Config, PriceBucketConfig, ILGPU.Runtime.MemoryBuffer1D`2<Double,Dense>)
   at Gotham.FurrowUtilities.BSGpu.ValueCallOption(Config, Int32, Boolean)
   at Gotham.FurrowUtilities.Reloading.Main(System.String[])

I suspect the issue may be that my ArrayView is allocated GPU-side, and that perhaps isn't allowed by Cublas.

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 511.65       Driver Version: 511.65       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            TCC  | 00000000:3B:00.0 Off |                  Off |
| N/A   37C    P8     9W /  70W |      1MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Environment

Steps to reproduce

var generalTemp = Accelerator.Allocate1D<float, Stride1D.General>(1024, new Stride1D.General());
var generalTarget = Accelerator.Allocate1D<int, Stride1D.General>(1, new Stride1D.General());
Blas.Amax(generalTemp, generalTarget.AsArrayView<int>(0, 1));

Expected behavior

An int is placed into generalTarget.

Instead there's a crash, as described above.

Additional context

No response

MoFtZ commented 1 week ago

hi @Ruberik, sorry for the late reply.

There are several Amax overloads on the CuBlas class for float.

  1. int Amax(ArrayView1D<float, Stride1D.General> input)
  2. void Amax(ArrayView1D<float, Stride1D.General> input, ArrayView<int> output).

You are trying to use the second version, which lets you fill an ArrayView.

According to the CuBlas documentation for Amax, the output can either be host or device memory.

The important part is that you need to tell CuBlas whether the result buffer is using host or device memory. This is achieved using the CuBlas Pointer Mode.

I was able to reproduce your issue. And was then able to fix it by calling blas.PointerMode = CuBlasPointerMode.Device;, just before the call to Amax.

using var input = accelerator.Allocate1D<float>(DataSize);
using var output = accelerator.Allocate1D<int>(1);

using var blas = new CuBlas(accelerator);
blas.PointerMode = CuBlasPointerMode.Device;
blas.Amax(input.View.AsGeneral(), output.View);
Ruberik commented 1 week ago

Thanks for the explanation, @MoFtZ! It's unfortunate that this has to be set by hand, but that's a lot better than a crash I can't do anything about. Thanks!

MoFtZ commented 1 week ago

@Ruberik if all your result buffers will be on the GPU, you just have to set it once, at the start.

There is also the other API, which returns the int result, without needing a buffer.