Open janezales opened 1 year ago
I think this is more of a Ray Data issue, not Ray Core. Ray Data uses Ray Core's shared memory to store outputs (DataIterator/Trainer inputs)
Ray Data currently does not pin memory. It might not make sense to do so for shared memory right now because we store Arrow blocks, not the final tensor batches in shared memory (although this may change in the future, see #41571). Also, shared memory blocks are usually read-once (unless ds.materialize()
is used).
But it should be easy enough to plug in something here. The way to do it would be to write a custom collate_fn that manages a pinned buffer and allocates the returned tensor batches there. Are you interested in contributing this? We'd be interested in seeing the results!
Description
SharedMemory - is it pinned (non-pageable)? And if not add a flag for pinning a tensor in shared memory in RAM
When using Ray SharedMemory for pushing/pulling to/from a GPU (TPU, etc.) user should have the ability to PIN this memory in RAM, that is, avoiding storing it in pageable memory and then copying it into pinned (non.pageable) memory (which is what cuda does), in the process speeding up copy operations between RAM and GPUs.
Use case
If not done yet, this would speed up your mem copy to/from Shared to GPUs for a factor of about 2. In the case of large model tensors, this is worth doing because there are many elements, In the case of training data, this is worth doing because you are doing many repetitions...