rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.06k stars 873 forks source link

[QST] cudf and kvikio? (this is actually a question about implementing RDMA in another library) #16004

Open lgray opened 1 month ago

lgray commented 1 month ago

Hi! I'm an experimental high energy physicist interested in rapid data analysis, in high energy physics we predominantly use the ROOT which has a pleasant python implementation here: https://github.com/scikit-hep/uproot5

cudf's cuFile and nvcomp based file i/o is extremely powerful (recently benchmarked this ourselves). However, I noticed that https://github.com/rapidsai/kvikio implements this i/o and decompression layer more generally for other libraries, and it is not used in cudf.

Would there be any lost performance from using kvikio as opposed to a dedicated C++ implementation, since uproot is nominally a python only implementation of the ROOT i/o standard? i.e. are there any major performance reasons reasons cudf does not use kvikio?

@martindurant @jpivarski

bdice commented 1 month ago

cuDF does use KvikIO's C++ library libkvikio for GPUDirect Storage, in the libcudf C++ code that underpins cuDF's Python interface. Here are some pointers to the build system and docs referencing KvikIO functionality in cuDF:

If you have any questions about KvikIO or how cuDF uses it, please feel free to ask!

lgray commented 1 month ago

Ahhh ok I missed that and only saw the stuff in the parquet library that has cuFile and nvcomp calls in C++. We'd like to start over in uproot by just working with local file access via DMA with kvikio to the GPU. Is there any conceptual issue we should be aware of there?

Thanks!

Also @nsmith-

bdice commented 1 month ago

@madsbk might be better suited to answer that question, but I think you should be fine to use the Python kvikio API or the C++ libkvikio API.

Also I don't know this source code very well but there is an example from the cuCIM library that shows how to do GPUDirect Storage reads of uncompressed TIFF images using the kvikio Python API. See: https://github.com/rapidsai/cucim/blob/branch-24.08/examples/python/gds_whole_slide/

madsbk commented 1 month ago

Is there any conceptual issue we should be aware of there?

Since uproot is pure Python, I guess you want to use KvikIO's Python API, which should look a lot like Python's regular file API: https://docs.rapids.ai/api/kvikio/nightly/quickstart/

Using CuFile.read/.write or their non-blocking counterparts, CuFile.pread/.pwrite, any array-like data (host or device memory) are supported.

lgray commented 1 month ago

OK - thanks @madsbk @bdice. I had read the quickstart, but became curious about performance penalties stemming from any conceptual issues with using the python bindings when I saw direct usage of cuFile and nvcomp in the C++/CUDA of cudf. This is why I am asking.

Is there anything we need to do in order to use fsspec effectively with this software stack? We do have a custom storage access protocol that we've glued into fsspec. I imagine local file access is fine, but is there any special care to be taken for remote s3, etc.?

madsbk commented 1 month ago

OK - thanks @madsbk @bdice. I had read the quickstart, but became curious about performance penalties stemming from any conceptual issues with using the python bindings when I saw direct usage of cuFile and nvcomp in the C++/CUDA of cudf. This is why I am asking.

Using the Python API in a Python application shouldn't introduce any extra overhead compared to writing your own bindings. cudf uses the C++ API because cudf is mostly written in C++.

Is there anything we need to do in order to use fsspec effectively with this software stack? We do have a custom storage access protocol that we've glued into fsspec. I imagine local file access is fine, but is there any special care to be taken for remote s3, etc.?

No, KvikIO only support local file access at the moment :/ It is something we are considering supporting, we are seeing a lot of interest in remote access lately.

lgray commented 1 month ago

Thanks for the clarification, we'll go ahead and give it a shot in uproot. Should be an interesting ride.

Are there any particular bits of care we need to give with respect to thread divergence and decompression? I imagine keeping chunks of data mostly the same size ~helps?

lgray commented 1 month ago

And, indeed, the remote file access will be super important for our use case. I'll talk to the nVidia contacts we have with Fermilab about getting some attention in this direction.

martindurant commented 1 month ago

Obviously, I have my own opinions on how remote file access should work, at least from the user API perspective, so if any group at rapids/nvidia/etc are working on this, I'd be happy to be part of the conversation.

bdice commented 1 month ago

cc: @rjzamora on the topic of remote filesystem access.

rjzamora commented 1 month ago

We are in the process of prioritizing remote-file performance in cudf, and it may make the most sense to leverage kvikio as a space to make the necessary improvements.

@martindurant - I'd love to get your thoughts in https://github.com/rapidsai/cudf/issues/15919

quasiben commented 1 month ago

If you are trying to do remote RDMA (GPU RDMA) you could use https://github.com/rapidsai/ucxx but you will need accelerated network devices as well. This is more common at HPC Labs so I would also ask about Infiniband or Slingshot. cc @pentschev

It looks like Wilson is being decommissioned soon but was architected with Infiniband NICs

lgray commented 1 month ago

Yes - we have an internal cluster that we're putting together testing this stuff on. We'll have to get accelerated NICs for it, which will come in time. We unfortunately don't have infiniband or slingshot on that cluster. Wilson's GPUs are quite old in any case, K40s, iirc.

The main thing I'd like to achieve with this is loading files from a remote storage system into GPU memory. While pretty obvious if it's mounted, with s3 or similar it wasn't clear the proper way forward. Will continue with kvikio and mounted filesystems in the mean time.

lgray commented 3 weeks ago

Is there any reason why I would start getting memory errors on a 40GB slice of an a100 when trying to allocate more than 10 GiB or so?

madsbk commented 3 weeks ago

Is there any reason why I would start getting memory errors on a 40GB slice of an a100 when trying to allocate more than 10 GiB or so?

Are you using RMM (if you are allocating through cudf, you are), you can try the new memory profiler in the nightly package: https://docs.rapids.ai/api/rmm/nightly/guide/#memory-profiler . It might show where and the size of the peak memory allocation.

lgray commented 3 weeks ago

@madsbk Great, thank you. I am indeed still using cudf so I'll try this out. awkward-array uses cupy as a backend, so in the fullness of time we'll have to understand if we're using rmm or some other manager.

bdice commented 3 weeks ago

CuPy does not use RMM by default but you can enable RMM as the memory manager for CuPy. Here are some references.

https://docs.rapids.ai/api/rmm/stable/guide/#using-rmm-with-third-party-libraries

https://docs.cupy.dev/en/stable/user_guide/interoperability.html#rmm

https://docs.cupy.dev/en/stable/user_guide/memory.html

lgray commented 3 weeks ago

Yes! I saw that, thanks for further docs. We'll have to figure out something in awkward that detects and properly uses RMM when it is present.

lgray commented 3 weeks ago

Indeed, it seems about 10GB of compressed input data corresponds to 41GB (and the MIG slice is 42GB).

image

I guess the peak allocation corresponds to intermediate decompression buffers and then the final array assembly buffer?

I imagine there's a way to do this where only pairs of from->to buffers are initialized at a time? Or is that found to be slower and the correct usage pattern is to utilize the GPU on data that is 1/4th the memory size available? This does give us ample room to pack in calculations.

martindurant commented 3 weeks ago

For parquet, whether or not you can decompress bytes into their final storage depends on quite a few different factors. You generally want V2 pages and SIMPLE encoding; and even then, only some data types will work (not string!). I wouldn't be surprised if cudf doesn't even bother looking for this efficient path.

lgray commented 3 weeks ago

This is float32/int/bool data in either singly-nested ragged arrays or just flat arrays. No strings or anything. Where should I root around in the parquet metadata for V2 pages, and these other doodads, to make sure I'm doing something optimal?

martindurant commented 3 weeks ago

Where should I root around in the parquet metadata for V2 pages

The encoding stats - how many pages of each type - are contained in the main file metadata (row_groups[].columns[].meta_data.encoding_stats). I think strictly they are optional, but I expect them to be present.

The pages themselves can be either V1 or V2 as you come to them. A "version: 2" in the global metadata guarantees nothing. In https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html , you see data_page_version as an option, V1 by default(!!), so if you were writing your own files, this is what you would change.

For INT, delta encoding would not allow decompress-to-target For FLOAT, byte_stream_split likewise Dictionary or RLE encoding would spoil any data type, but you really shouldn't use that for numbers, unless you have a very few unique but large values.

martindurant commented 3 weeks ago

Oh, and I don't know if you can see any of those details via pyarrow. I use fastparquet...

lgray commented 3 weeks ago

It looks like I can enforce these options by setting data_page_version="2.0" and column_encoding="PLAIN" in pyarrow. We'll regenerate the file and try again.

lgray commented 3 weeks ago

@fstrug you should follow this!

nsmith- commented 3 weeks ago

For INT, delta encoding would not allow decompress-to-target For FLOAT, byte_stream_split likewise

These encodings are something we will want in our data, especially the split float (due to our use of mantissa truncation for reduced precision), so maybe not worth worrying too much about minimizing intermediate buffers and just operate on smaller numbers of row groups at a time?

lgray commented 3 weeks ago

@nsmith- We'll have to see how this balances with throughput performance. I'd be curious if we can load the next dataset in while finishing compute on the last. Probably some automated sizing can be done per-analysis.