Tracking support of new cuFile features

madsbk commented 1 year ago

Meta Issue to track support of new cuFile features.

[x] Batch IO (C++): https://github.com/rapidsai/kvikio/pull/220
[ ] Batch IO (Python)
[x] Async IO (C++: https://github.com/rapidsai/kvikio/pull/259
[x] Async IO (Python)

jakirkham commented 1 year ago

C++ support for Batch IO was done in PR ( https://github.com/rapidsai/kvikio/pull/220 ), right? Or is this about Python support?

madsbk commented 1 year ago

C++ support for Batch IO was done in PR ( #220 ), right? Or is this about Python support?

Yes, updated the issue

fstrug commented 1 month ago

Is there any timeline for batch IO support in python?

madsbk commented 4 weeks ago

Is there any timeline for batch IO support in python?

No, not at the moment but we could prioritize it. Do you have a particular use case in mind?

fstrug commented 4 weeks ago

Do you have a particular use case in mind?

We are working on developing a tool for high energy particle physicists that will use the GPU to read and store data directly on the GPU for later use in an analysis. Our data is stored row-wise, so the bytes for any column are divided into many small baskets that are spread throughout the length of the file. To get a column of data out of the file to an array, we are performing many small CuFile.preads() (~300 reads of 10^5 bytes) at different offsets. With the CuFile API calls being FIFO, it seems that a batch API call would be the performant way to launch these reads.

madsbk commented 4 weeks ago

Yes, sounds like the batch API could be useful.

Currently with CuFile.preads(), are you using the thread pool be setting KVIKIO_NTHREADS or calling kvikio.default.num_threads_reset()?

fstrug commented 4 weeks ago

I have tried adjusting the thread pool size. The performance decreases when increasing the size of the thread pool from the default when doing many CuFile.preads(). Setting KVIKIO_GDS_THRESHOLD = 4096 so that all my calls should go to the thread pool does not affect performance. This may have something to do with running in compatibility mode KVIKIO_COMPAT_MODE=True, but setting this to False gives me worse performance. The docs mention kvikio and CuFile have separate compatibility mode settings, but I don't see how to check if CuFile is or not. I have installed libcufile before installing kvikio, but kvikio uses KVIKIO_COMPAT_MODE=True by default.

Does the thread pool have a 1:1 correspondence with the number of CUDA threads that will be used by kvikio? In some of my checks, the read times scaled weaker than I would have expected with the number of threads used for reading with when KVIKIO_COMPAT_MODE = False. When KVIKIO_COMPAT_MODE = True, I get better performance and there is still some scaling wrt the size of the thread pool.

I am working on a server with a 20GB slot of a 80GB A100 in the below tests.

madsbk commented 3 weeks ago

KVIKIO_COMPAT_MODE=True means that KvikIO are doing all the work using a thread pool and regular POSIX IO, which, I think, is the fastest setting in your case.

KVIKIO_NTHREADS isn't related to CUDA threads. It is the maximum number of POSIX threads that KvikIO will use concurrently to call POSIX read/write.

Could you try with fewer threads, maybe KVIKIO_NTHREADS=8 or KVIKIO_NTHREADS=16 ?

PS: I am away all of next week, so might not be able to reply until the week after.

fstrug commented 2 weeks ago

KVIKIO_COMPAT_MODE=True means that KvikIO are doing all the work using a thread pool and regular POSIX IO, which, I think, is the fastest setting in your case.

For our environments, we start from a base conda image that has some version of CUDA installed (currently 12.2). The default setting for this value in my environment is True implying libcufile may not have been found. I find when running the C++ examples it is unable to find libcufile and even some other dependencies like bs_thread_pool. I set my PATH to contain /home/fstrug/.conda/envs/kvikio-env/bin and LD_LIBRARY_PATH to contain /home/fstrug/.conda/envs/kvikio-env/lib. I can see that libcufile is installed within the conda environment (/home/fstrug/.conda/envs/img_cuda12.2-kvikio/lib/libcufile.so), so it isn't clear to me why these aren't being built with kvikio. Is there a way to explicitly check if the python module is unable to find cufile as well, I don't see any way from the docs?

Could you try with fewer threads, maybe KVIKIO_NTHREADS=8 or KVIKIO_NTHREADS=16 ?

Even with these values, I am still not seeing performance improvements. Often the pread is only reading 10^3 bytes at a time, which might be a barrier to seeing performance increases with more threads.

madsbk commented 2 weeks ago

Yes, reading 1k chunks is very small. How many of the columns do you need? It might be better to read big chunks of the columns and transpose in memory, even if it means you have to read some unneeded columns.

fstrug commented 1 week ago

There can be ~1000s of columns and users usually need to only read a small subset of these (~5). Designing an algorithm to optimize the reads based on columns requested is something we've considered, but there may be a better path forward for us with batch functionality. If cufile performs the reads with the GPU threads when not in compatibility mode, I presume that for some large contiguous read it is already being 'batched' for threads during execution?

rapidsai / kvikio

Tracking support of new cuFile features #204