Open cadedaniel opened 1 year ago
@scottsun94 can you please evaluate this and set a priority?
@jjyao @jonathan-anyscale @rkooo567
In 2.9, can we add the profiling tab in the job detail page and show a list of profiling files there for people to download, including this type of traces? At least, people can download and visualize by themselves.
The visualization of those traces/files could be the next step
We need to see how we can achieve it with pytorch (it is pattern 2 from this doc https://docs.google.com/document/d/1MYM7ImPQmuEfcKMoK0hx_2h4rBesgMTAwalGrSWscgQ/edit). cc @jonathan-anyscale
but the general direction sgtm
Description
Support PyTorch memory usage visualizations. This can be used to see where memory is being consumed. Can be used to optimize memory usage, or to see the cause of OOMs in PyTorch allocations (
torch._C._cuda_attach_out_of_memory_observer(oom_observer)
).See https://zdevito.github.io/2022/12/09/memory-traces.html for more info
profile view:
snapshot view:
Use case