Deserialize bytes to array on the GPU directly

goelayu commented 1 month ago

I am using kvikio to read an array stored on a file on the disk, directly onto the GPU. On the GPU I want to deserialize the file content into an array.

size = os.path.getsize(file_path)
with kvikio.cuFile(file_path, 'r') as f:
  tensor = cp.empty(size//4)
  f.read(tensor)

My understanding is that the deserialization should happen on the GPU itself, i.e., there is no host CPU involved. However when I profile the above code using nsys, I don't see any activity on the GPU corresponding to the deserialization. Also when looking at the CPU utilization of my code, it seems that that CPU is doing the work of deserialization. Why is this case?

jakirkham commented 1 month ago

For this use case, it might be worth looking at KvikIO's fromfile & tofile API

There are some examples at the top of this PR: https://github.com/rapidsai/kvikio/pull/135

Admittedly this doesn't answer the question as to why things are not working in your case, but maybe it provides a path forward

goelayu commented 1 month ago

Thanks for the pointer. My objective is to understand the differences and compare the performance of cupy.fromfile and kvikio.numpy.fromfile.

I believe the cupy API 1) reads the file and deserializes the bytes to an array, on the host itself (using numpy.fromfile) and 2) copies the array to the GPU using cudamemcpy. The kvikio API, on the other hand, 1) first copies the data to the GPU and then 2) deserializes the data into an array, on the GPU itself.

However, when profiling the code using both nsys for GPU computations and cprofile for host computations, it seems to me that the deserialization is happening on the host in both the cases.

Profiler output for cupy.fromfile

 tottime  percall  cumtime  percall filename:lineno(function)
 2.672    2.672    2.672    2.672 {built-in method numpy.fromfile}

The numpy.fromfile method takes care of the deserialization and accounts for most of the runtime.

Profiler output for kvikio.numpy.fromfile

tottime  percall  cumtime  percall filename:lineno(function)
 0.000    0.000    3.376    3.376 /lib/python3.10/site-packages/kvikio/cufile.py:206(read)
 3.371    3.371    3.371    3.371 /lib/python3.10/site-packages/kvikio/cufile.py:44(get)
 1.707    1.707    1.707    1.707 /lib/python3.10/site-packages/kvikio/cufile.py:70(__init__)

Looks like most of the time is spent inside the kvikio.cuFile methods, implying that the deserialization is being performed on the host (also note that the nsys profiler shows no extra GPU computations, resulting in the same conclusion).

To summarize, my questions are: 1) Is my understanding on the semantic differences between cupy.fromfile and kvikio.numpy.fromfile accurate? 2) If so, why am I not seeing the deserialization being offloaded to the GPU?

jakirkham commented 1 month ago

Could you please share the code used in the second case? It is hard to comment on what is happening there without knowing what was done

madsbk commented 1 month ago

When reading a binary file, cupy.fromfile doesn't do any computation unless it has to convert the data to little-endian thus the deserialization is essential free.

If GDS isn't available, cupy.fromfile and kvikio.numpy.fromfile do exactly the same. They first read from disk into a bounce buffer, and then copy to device.

Now, if GDS is available and the data is larger than KVIKIO_GDS_THRESHOLD (1 MiB), KvikIO will not use a bounce buffer but instead use GDS to write directly to device memory (skipping the CPU).

However, even when GDS isn't available, kvikio.numpy.fromfile typically outperforms cupy.fromfile when using multiple threads. Try setting the environment variable KVIKIO_NTHREADS.

jakirkham commented 1 month ago

Thanks Mads! 🙏

Do we document somewhere how to check whether KvikIO is able to use GDS? Think this might be a useful diagnostic test for Akshay (and future users) to run through to confirm they have a working configuration

rapidsai / kvikio

Deserialize bytes to array on the GPU directly #436