Need to find a way to scale across GPUs, possible options:
Multiple GPUs can be loaded for inference using pytorch Data Parallelism (this does not work on every model) in order to parallelize the batches across multiple GPUs. One important consideration there is to use either a single threaded scheduler (not recommended) or to limit the number of workers to be the same as the number of GPU cores dask.config.set(num_workers=<#GPU>) to avoid running into issues. Other alternatives could include assigning GPUs to spawned processes (not tested yet).
Other option could be to test out the [LocalCUDACluster] which seems like the intended way to run GPU components with Dask. Requires testing it and how it interacts with pytorch
Whether to run a model using the processes or threaded scheduler (so far, the threaded scheduler has shown to be faster). However, most resources seem to indicate to use threads (link).
Open questions:
How to parallelize GPU and CPU tasks efficiently: limiting the number of workers can leave some workers/CPU cores idle (when #GPU in one machine is larger than the number of cores). There is some room for optimization.
By @PhilippeMoussalli:
Conclusions from https://github.com/ml6team/fondant/pull/489
Need to find a way to scale across GPUs, possible options:
Multiple GPUs can be loaded for inference using pytorch Data Parallelism (this does not work on every model) in order to parallelize the batches across multiple GPUs. One important consideration there is to use either a single threaded scheduler (not recommended) or to limit the number of workers to be the same as the number of GPU cores dask.config.set(num_workers=<#GPU>) to avoid running into issues. Other alternatives could include assigning GPUs to spawned processes (not tested yet).
Other option could be to test out the [LocalCUDACluster] which seems like the intended way to run GPU components with Dask. Requires testing it and how it interacts with pytorch
Whether to run a model using the processes or threaded scheduler (so far, the threaded scheduler has shown to be faster). However, most resources seem to indicate to use threads (link).
Open questions: