ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.9k stars 5.76k forks source link

[Dashboard] Support PyTorch memory usage visualizations #39878

Open cadedaniel opened 1 year ago

cadedaniel commented 1 year ago

Description

Support PyTorch memory usage visualizations. This can be used to see where memory is being consumed. Can be used to optimize memory usage, or to see the cause of OOMs in PyTorch allocations (torch._C._cuda_attach_out_of_memory_observer(oom_observer)).

See https://zdevito.github.io/2022/12/09/memory-traces.html for more info

profile view: trace3

snapshot view:

Screenshot 2023-09-26 at 1 53 18 PM

Use case

anyscalesam commented 1 year ago

@scottsun94 can you please evaluate this and set a priority?

scottsun94 commented 1 year ago

@jjyao @jonathan-anyscale @rkooo567

In 2.9, can we add the profiling tab in the job detail page and show a list of profiling files there for people to download, including this type of traces? At least, people can download and visualize by themselves.

Screenshot 2023-10-24 at 5 17 43 PM

The visualization of those traces/files could be the next step

rkooo567 commented 1 year ago

We need to see how we can achieve it with pytorch (it is pattern 2 from this doc https://docs.google.com/document/d/1MYM7ImPQmuEfcKMoK0hx_2h4rBesgMTAwalGrSWscgQ/edit). cc @jonathan-anyscale

rkooo567 commented 1 year ago

but the general direction sgtm