[Dashboard | Core?] Py-Spy and Python 3.12 - Lack of support

mtralka commented 3 months ago

What happened + What you expected to happen

Py-Spy does not yet support Python 3.12 - nor does it seem to have much momentum to (https://github.com/benfred/py-spy/issues/670, https://github.com/benfred/py-spy/issues/661, https://github.com/benfred/py-spy/issues/633). On Python 3.12 and Ray 2.32 the CPU flame graph and stack trace abilities of the Ray Dashboard are broken / non-functional due to upstream py-spy lack of support for 3.12.

Does Ray have plans to transition to another statistical profiler (ex. https://github.com/P403n1x87/austin) to support the 3.12 releases? What would this level of effort be? Is the Ray team open to PRs here?

I understand Ray is in python 3.12 alpha support per https://github.com/ray-project/ray/issues/45904. Is this issue a prerequisite for a non-alpha release?

Pyroscope seems to have the same issue: https://github.com/grafana/pyroscope-rs/issues/168 https://github.com/grafana/pyroscope/issues/3287

Versions / Dependencies

Python 3.12 Ray 2.32

Reproduction script

Following the documentation for Py-Spy in KubeRay - https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/pyspy.html#kuberay-pyspy-integration.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Superskyyy commented 3 months ago

We have tried to use Pyroscope continous profiler and it works pretty well, but in the long run the new elastic ebpf profiler would be the way to go, I imagine.

Superskyyy commented 3 months ago

https://github.com/open-telemetry/opentelemetry-ebpf-profiler

anyscalesam commented 2 months ago

TY for raising this; we'll review this week internally and see what solution we can do here. I think there is a non pyspy path for CPU flamegraphs IIRC

mtralka commented 2 months ago

Thank you @anyscalesam. Any update on this path internally?

From an external view it seems prudent to create another CpuProfilingManager but for a non-pyspy target. We would have to ensure it has flame graph and / or speedscope support. Plus then support plumbing ProfilingManager selection into the dashboard reporter agent. Thoughts? I'd be happy to implement something like this if the project is interested in such a path.

anyscalesam commented 2 months ago

cc @alanwguo here > we're looking at Scalene and PyInstrument as possible substitutes.

mtralka commented 2 months ago

Thanks for the quick reply. Both of those look promising. The AI features of Scalene is slightly concerning from a naive perspective (I assume phoning home is opt-in?). Selection of profiler does seem to be the hardest decision here.

A "bring your own" profiler approach seems potentially attractive. Should Ray lib make the choice of what profiler is shipped with it or should we enable the user to provide their own + register an adapter class (basically a ProfilingManager) as needed?

alanwguo commented 2 months ago

Hi @mtralka , thanks for offering to help out here. Your outline of what changes would be needed makes sense to me. I agree that the selection of the profiler is the hardest decision here.

I do like the idea of "bring your own" profiler, although we'd probably still in the future replace pyspy as the default. How would you imagine a user could register an adapter class?

mtralka commented 2 months ago

Of course, happy to help where possible. I agree pyspy shouldn't be the default if py3.12 support is mainstreamed.

Regarding implementation - this could certainly be a rabbit hole! IMO a relatively simple / straightforward approach here seems prudent since there doesn't seem to be much user desire for new features here other than fixing py3.12 (please correct me if I'm wrong!).

Some questions / assumptions:

input from ray is always constant across profiler adapters (pid, native, etc)
output should be delivered in a compatible format with existing ray dashboard features
- This format varies across profiling tasks
- cpu profile - likely the same as currently supported - flamegraph, "raw", speedscope.
  - I assume "raw" is a pyspy specific format. This might be an issue if raw is interpreted downstream in Ray. If it isn't, then no worries
- trace dump - I'm unaware of a "standard" format here. Would need to find / establish a standard here. I think it is safe to make the user adapt to this standard rather then try to handle every format variation internally. Do you have any domain expertise here?
- memory profiling - likely the same as currently supported - flamegraph, table

For profilers with "unique" features like Scalene's Web UI we'd have to consider how to present these features.

In its simplest form we can keep the existing class based approach. Pseudo ->

def register_profiler(profiler_manager: type[CpuProfilingManager], as_default: bool = False):
    """Register a CPU profiling manager.

    as_default: bool
        Whether to set the given profiler as the default one.
    """
    if not issubclass(profiler_manager, CpuProfilingManager):
        raise TypeError

    _PROFILERS[profiler_manager.PROFILER_NAME] = profiler_manager

def get_profiler(name: str) -> CpuProfilingManager:
  return _PROFILERS[profiler_manager.PROFILER_NAME]

async def GetTraceback(self, request, context):
        pid = request.pid
        native = request.native
        desired_profiler_name = ... # pull from ENV / config singleton / however ray handles config
        profiler_type = get_profiler(desired_profiler_name)
        p = profiler_type(self._log_dir)
        success, output = await p.trace_dump(pid, native=native)
        return reporter_pb2.GetTracebackReply(output=output, success=success)

Where we register subclasses of theCpuProfilingManager. The current CpuProfilingManager is PySpy specific but implementation seems to be refactorable without changing the API.

Note that using a class-based profiler system delineated by CPU / MEM category is a bit limiting / obtuse for combined profilers such as Scalene which handle CPU / GPU / MEM all in one.

Thoughts? Totally acceptable if I'm off base here

alanwguo commented 2 months ago

@mtralka, registering subclasses or functions can make sense.

We should see if the output of the profiling functions can be left to be pretty generic. With py-spy, the tracedumb is raw text, the flamegraph is an image file (renderable by the browser).

Memory profiling is actually handled separately using "Memray" instead of py-spy. We should consider making that customizable as well. I believe the out of this is also raw text.

It seems if the generic output format is any html-renderable document, that may be simple interface to conform to.

In terms of class-based override or not. I don't think we necessarily need to adhere to the current pattern of using classes. We could override individual functions per profiling functionality. I don't think there is very much code re-use between the functions within the same class.

Do you want to give that a shot?

kwongtn commented 3 weeks ago

pyspy just released 0.4.0 with Python 3.12 and 3.13 support 😁

https://github.com/benfred/py-spy/releases/tag/v0.4.0

ray-project / ray