Open goelayu opened 1 month ago
For this use case, it might be worth looking at KvikIO's fromfile
& tofile
API
There are some examples at the top of this PR: https://github.com/rapidsai/kvikio/pull/135
Admittedly this doesn't answer the question as to why things are not working in your case, but maybe it provides a path forward
Thanks for the pointer. My objective is to understand the differences and compare the performance of cupy.fromfile
and kvikio.numpy.fromfile
.
I believe the cupy
API 1) reads the file and deserializes the bytes to an array, on the host itself (using numpy.fromfile
) and 2) copies the array to the GPU using cudamemcpy
.
The kvikio
API, on the other hand, 1) first copies the data to the GPU and then 2) deserializes the data into an array, on the GPU itself.
However, when profiling the code using both nsys
for GPU computations and cprofile
for host computations, it seems to me that the deserialization is happening on the host in both the cases.
Profiler output for cupy.fromfile
tottime percall cumtime percall filename:lineno(function)
2.672 2.672 2.672 2.672 {built-in method numpy.fromfile}
The numpy.fromfile
method takes care of the deserialization and accounts for most of the runtime.
Profiler output for kvikio.numpy.fromfile
tottime percall cumtime percall filename:lineno(function)
0.000 0.000 3.376 3.376 /lib/python3.10/site-packages/kvikio/cufile.py:206(read)
3.371 3.371 3.371 3.371 /lib/python3.10/site-packages/kvikio/cufile.py:44(get)
1.707 1.707 1.707 1.707 /lib/python3.10/site-packages/kvikio/cufile.py:70(__init__)
Looks like most of the time is spent inside the kvikio.cuFile
methods, implying that the deserialization is being performed on the host (also note that the nsys
profiler shows no extra GPU computations, resulting in the same conclusion).
To summarize, my questions are:
1) Is my understanding on the semantic differences between cupy.fromfile
and kvikio.numpy.fromfile
accurate?
2) If so, why am I not seeing the deserialization being offloaded to the GPU?
Could you please share the code used in the second case? It is hard to comment on what is happening there without knowing what was done
When reading a binary file, cupy.fromfile
doesn't do any computation unless it has to convert the data to little-endian thus the deserialization is essential free.
If GDS isn't available, cupy.fromfile
and kvikio.numpy.fromfile
do exactly the same. They first read from disk into a bounce buffer, and then copy to device.
Now, if GDS is available and the data is larger than KVIKIO_GDS_THRESHOLD
(1 MiB), KvikIO will not use a bounce buffer but instead use GDS to write directly to device memory (skipping the CPU).
However, even when GDS isn't available, kvikio.numpy.fromfile
typically outperforms cupy.fromfile
when using multiple threads. Try setting the environment variable KVIKIO_NTHREADS
.
Thanks Mads! 🙏
Do we document somewhere how to check whether KvikIO is able to use GDS? Think this might be a useful diagnostic test for Akshay (and future users) to run through to confirm they have a working configuration
I am using
kvikio
to read an array stored on a file on the disk, directly onto the GPU. On the GPU I want to deserialize the file content into an array.My understanding is that the deserialization should happen on the GPU itself, i.e., there is no host CPU involved. However when I profile the above code using
nsys
, I don't see any activity on the GPU corresponding to the deserialization. Also when looking at the CPU utilization of my code, it seems that that CPU is doing the work of deserialization. Why is this case?