NVIDIA DALI pipelines into Xarray to facilitate direct loading of data into GPU memory.

negin513 commented 1 month ago

Is your feature request related to a problem?

I suggest the integration of NVIDIA DALI (Data Loading Library) pipelines into Xarray to enable efficient loading of data directly into GPU memory to avoid CPU-GPU transfer bottlenecks and enhance performance for AI/ML workflows across multiple GPUs.

Describe the solution you'd like

Ideally we can extend the xr.open_dataset() function to accept a new argument, for example dali_pipeline. This argument will allow users to pass a DALI pipeline object directly for Xarray dataset loading.

dali_pipeline = dali.Pipeline(batch_size=N, num_threads=Y, device_id=Z)

ds = xr.open_mfdataset(my_files,  dali_pipeline=dali_pipeline)

Here is an example of NVIDIA DALI numpy reader that can be used with Xarray.

Describe alternatives you've considered

Users manually load data into CPU memory, preprocess it using custom scripts, and then transfer it to GPU memory for ML workflows, which leads to huge latency and overhead in CPU-GPU memory transfers, especially for distributed multi-GPU ML workflows.

As more people are training ML models using with multiple GPUs, this integration will significantly improve their workflows and reduce CPU-GPU memory transfer overhead.

Additional context

I look forward to the community's feedback and am happy to assist with the implementation process.

TomNicholas commented 1 month ago

This sounds really cool and powerful!

Ideally we can extend the xr.open_dataset() function to accept a new argument, for example dali_pipeline.

I think it's very unlikely that we added a new argument to open_dataset for this, but custom xarray backends can add whatever keyword arguments they want, so couldn't you just add this pipeline feature in a custom backend? Maybe in the kvikio backend?

negin513 commented 1 month ago

Hey @TomNicholas , Yes this should definitely be an argument for a backend rather than an argument for open_dataset. DALI's pipeline is a bit different from kvikio backend (possibly an alternative to it), but I can see both being part of a single backend too. Mostly I have added this issue to gauge community's interest in this and gather some thoughts around usefulness of this feature. Maybe @weiji14 has explored DALI already for data streaming to GPU? @weiji14 please let me know.

One big constraint here is GDS setup requirements.

pydata / xarray