Open madsbk opened 1 year ago
C++ support for Batch IO was done in PR ( https://github.com/rapidsai/kvikio/pull/220 ), right? Or is this about Python support?
C++ support for Batch IO was done in PR ( #220 ), right? Or is this about Python support?
Yes, updated the issue
Is there any timeline for batch IO support in python?
Is there any timeline for batch IO support in python?
No, not at the moment but we could prioritize it. Do you have a particular use case in mind?
Do you have a particular use case in mind?
We are working on developing a tool for high energy particle physicists that will use the GPU to read and store data directly on the GPU for later use in an analysis. Our data is stored row-wise, so the bytes for any column are divided into many small baskets that are spread throughout the length of the file. To get a column of data out of the file to an array, we are performing many small CuFile.preads()
(~300 reads of 10^5 bytes) at different offsets. With the CuFile API calls being FIFO, it seems that a batch API call would be the performant way to launch these reads.
Yes, sounds like the batch API could be useful.
Currently with CuFile.preads()
, are you using the thread pool be setting KVIKIO_NTHREADS
or calling kvikio.default.num_threads_reset()
?
I have tried adjusting the thread pool size. The performance decreases when increasing the size of the thread pool from the default when doing many CuFile.preads()
. Setting KVIKIO_GDS_THRESHOLD = 4096
so that all my calls should go to the thread pool does not affect performance. This may have something to do with running in compatibility mode KVIKIO_COMPAT_MODE=True
, but setting this to False
gives me worse performance. The docs mention kvikio
and CuFile
have separate compatibility mode settings, but I don't see how to check if CuFile
is or not. I have installed libcufile
before installing kvikio
, but kvikio
uses KVIKIO_COMPAT_MODE=True
by default.
Does the thread pool have a 1:1 correspondence with the number of CUDA threads that will be used by kvikio? In some of my checks, the read times scaled weaker than I would have expected with the number of threads used for reading with when KVIKIO_COMPAT_MODE = False
. When KVIKIO_COMPAT_MODE = True
, I get better performance and there is still some scaling wrt the size of the thread pool.
I am working on a server with a 20GB slot of a 80GB A100 in the below tests.
KVIKIO_COMPAT_MODE=True
means that KvikIO are doing all the work using a thread pool and regular POSIX IO, which, I think, is the fastest setting in your case.
KVIKIO_NTHREADS
isn't related to CUDA threads. It is the maximum number of POSIX threads that KvikIO will use concurrently to call POSIX read/write.
Could you try with fewer threads, maybe KVIKIO_NTHREADS=8
or KVIKIO_NTHREADS=16
?
PS: I am away all of next week, so might not be able to reply until the week after.
KVIKIO_COMPAT_MODE=True
means that KvikIO are doing all the work using a thread pool and regular POSIX IO, which, I think, is the fastest setting in your case.
For our environments, we start from a base conda image that has some version of CUDA installed (currently 12.2). The default setting for this value in my environment is True
implying libcufile may not have been found. I find when running the C++ examples it is unable to find libcufile and even some other dependencies like bs_thread_pool
. I set my PATH
to contain /home/fstrug/.conda/envs/kvikio-env/bin
and LD_LIBRARY_PATH
to contain /home/fstrug/.conda/envs/kvikio-env/lib
. I can see that libcufile is installed within the conda environment (/home/fstrug/.conda/envs/img_cuda12.2-kvikio/lib/libcufile.so
), so it isn't clear to me why these aren't being built with kvikio. Is there a way to explicitly check if the python module is unable to find cufile as well, I don't see any way from the docs?
Could you try with fewer threads, maybe
KVIKIO_NTHREADS=8
orKVIKIO_NTHREADS=16
?
Even with these values, I am still not seeing performance improvements. Often the pread
is only reading 10^3 bytes at a time, which might be a barrier to seeing performance increases with more threads.
Yes, reading 1k chunks is very small. How many of the columns do you need? It might be better to read big chunks of the columns and transpose in memory, even if it means you have to read some unneeded columns.
There can be ~1000s of columns and users usually need to only read a small subset of these (~5). Designing an algorithm to optimize the reads based on columns requested is something we've considered, but there may be a better path forward for us with batch functionality. If cufile
performs the reads with the GPU threads when not in compatibility mode, I presume that for some large contiguous read it is already being 'batched' for threads during execution?
Meta Issue to track support of new cuFile features.