rapidsai / dask-cuda

Utilities for Dask and CUDA interactions
https://docs.rapids.ai/api/dask-cuda/stable/
Apache License 2.0
286 stars 91 forks source link

Expand spill logging #860

Open pentschev opened 2 years ago

pentschev commented 2 years ago

Lately there has been growing interest from users to be capable of gathering information from Dask-CUDA spilled data. Initially https://github.com/rapidsai/dask-cuda/pull/442 added the possibility to log spilling times, that the user can query at will and get information on all spilling operations that happened. However, this is limited to the "default" spilling, and not present for on-demand/JIT-unspill. There's also no information other than total time spent per operation nor any examples on how to use it.

I believe it would be useful to have the following added:

cc @Matt711

pentschev commented 2 years ago

Also pinging @ayushdg @jnke2016 @randerzander who may have other feature requests in mind.

shwina commented 2 years ago

FYI: I'm looking into the related problem of visualizing GPU spilling.

pentschev commented 2 years ago

FYI: I'm looking into the related problem of visualizing GPU spilling.

You mean you want to visualize it but there's no way to do that, or there's a problem with the current visualizer (assuming there's one, TBH I don't know if there is)?

Matt711 commented 2 years ago

Keeping the conversation going. Hey, @shwina I talked with @pentschev about this issue. If I can assist you with a similar issue, I'd love to.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Matt711 commented 2 years ago

The issue is still in progress. I will begin working actively on it next week.

Matt711 commented 2 years ago

Will start working on this issue next week. I was busy with getting the Dask Operator ready for release.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

randerzander commented 2 years ago

We depend on jit unspilling in most workflows now.

In trying to determine the right amount of GPU memory for a given workload, we'd like to know how often we spill, and how much time is spent spilling. There's not a good way to gather this information currently without manually looking at workflow profiles.

Since our profiles are for a great many jobs, that becomes an inordinately time consuming process. It would be very useful for dask-cuda to log something like: timestamp, worker_id, memory request size, spilled object size, time elapsed during spill

The above field names probably imply a misunderstanding about how spilling actually works, but I hope it conveys that with such information, we can programmatically find workloads that could be optimized to avoid spilling.

madsbk commented 2 years ago

I have been planning to implement this for JIT unspilling for some time but now that we are introducing spilling in cuDF it might be sufficient to include spill logging in cuDF?

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.