notes on profiling Trio programs

belm0 commented 5 years ago

profiler categories: deterministic (cProfile, yappi), sampling (vmprof, pyflame)

deterministic:

measures "own time", "total time" (own time + called children), and call counts
I'm seeing dubious call counts from cProfile
clock resolution is important
clock type is important (wall time, CPU time, thread time)
both cProfile and yappi support wall time and CPU time
CPU time is preferred (though in a Trio program nothing should be using real sleep except the scheduler)
there may be tradeoff between clock resolution and type. E.g. on OS X I observe 100 ns resolution for perf_counter(), but only 1 usec resolution for process_time().
yappi is supposed to handle multiple OS threads (though Trio programs often avoid threads)

sampling:

measures hits per function (and optionally per code line)
infers "own time" and "total time" from the hits
much less overhead than deterministic
sampling rate is important
effective sample rate of vmprof is apparently < 300 Hz
vmprof can sample native functions

the bad news: none of these profilers appear to be coroutine aware Neither type of profiler will correctly report "total time" in the context of cooperative multitasking. E.g. function awaits on trio.sleep, which yields to scheduler, which resumes another task, and so on, all incorrectly attributed as children of the original function.

The profilers should somehow use yield or else details of the particular event loops (asyncio, Trio, etc.) to stop attributing time to the parent function. They apparently already do this for select, etc.

It seems like a ripe area to improve tooling for async / await / Trio. I can't find any prior work for Python addressing this.

the good news: "own time" and function counts are still useful in the context of coroutines

njsmith commented 5 years ago

I haven't done an extensive comparison, but last time I used vmprof it was definitely "coroutine aware" in the sense you mean. (Really, it just queries the interpreter for the current stack whenever the sampling interrupt fires, and the interpreter stack does not include suspended tasks.)

njsmith commented 5 years ago

Regarding your categories: I think of "deterministic profilers" as more like a kind of coverage tool. They can be handy in specific circumstances if you're curious which functions are called, and how many times, but if I want to know about actual time (CPU or wall-clock), then I don't use them. Too many times that I've had cProfile tell me that a function was slow because cProfile was making it slow.

Anyway, I wrote a little test program that I thought might misattribute time to sleep, and ran it through vmprof, cProfile, and py-spy, and none of them show sleep as taking up any time at all.

Are you sure that your code isn't, like... accidentally calling await trio.sleep(0) in a tight loop or something?

Test program:

```python import trio import trio.testing async def main(): async with trio.open_nursery() as nursery: async def sleepy_child(): await trio.sleep(5) nursery.cancel_scope.cancel() s, r = trio.open_memory_channel(1) async def noisy_sender(): while True: await s.send(None) async def noisy_receiver(): while True: await r.receive() nursery.start_soon(sleepy_child) await trio.testing.wait_all_tasks_blocked() nursery.start_soon(noisy_sender) nursery.start_soon(noisy_receiver) trio.run(main) ```

belm0 commented 5 years ago

Thank you-- I think you're right and I'm revisiting vmprof.

A few things I'd like to make sense of:

When the sampling profiler attributes 40% of runtime to handle_io, and there is negligible time on that node's children, what does it mean? No CPU use? Kernel CPU use?

screen shot 2019-02-22 at 12 54 58 pm

the call tree for HTTP get via asks package looks truly pathological. Following is only the head 10% of call tree:

screen shot 2019-02-22 at 12 56 04 pm

belm0 commented 5 years ago

Yes, if we're to believe vmprof, sleep and sleep_until are taking (8% + 7%) of my app time.

Likely due to many loops which sleep for a small time, wake up, and do negligible work. It may reflect things not being event-driven enough. Casual programmers tend to coordinate by periodic loops which share state, rather than deal with the intricacies of building and composing events.

screen shot 2019-02-22 at 1 40 50 pm

generic_function detail:

screen shot 2019-02-22 at 1 50 08 pm

njsmith commented 5 years ago

When the sampling profiler attributes 40% of runtime to handle_io, and there is negligible time on that node's children, what does it mean? No CPU use? Kernel CPU use?

Huh. I have no idea, actually. If it was counting time sleeping ("no CPU use"), then I would expec that to show up inside kqueue.control, which I guess is probably buried under the call_function node there? But there's not a lot going on in handle_io.

the call tree for HTTP get via asks package looks truly pathological. Following is only the head 10% of call tree:

The top part (above PyStack_AsDict) looks pretty reasonable to me. And below that is just a bunch of random details about the interpreter's internals. Notice that it's all 0%, meaning probably like, 1 sample happened to observe each of those call stacks. I'd just ignore the 0% stuff.

Yes, if we're to believe vmprof, sleep and sleep_until are taking (8% + 7%) of my app time. (It's a soft real-time app heavily using Trio.)

For the sleep image you showed, the fact that all the time is in checkpoint tells me that these are all calls to await sleep(0). Can you tell me more about how you're using await sleep(0)?

Right now checkpoint is super un-optimized, and I take this as evidence that we probably should optimize it more (https://github.com/python-trio/trio/issues/70). But I'd like to know more about what's going on :-)

njsmith commented 5 years ago

Oh, and I'm totally baffled by the generic_function thing. I have no idea what that is. We do have something called generic_function in trio, but literally the only thing it's used for is when defining open_memory_channel, and there's no way checkpoint should be defining open_memory_channel.

belm0 commented 5 years ago

Can you tell me more about how you're using await sleep(0)?

Two cases:

dividing CPU bound work - generally try not to have any task go more than 1 ms without a checkpoint, to keep the scheduler responsive. (app "frame time" is 33 ms)
honoring "async functions must checkpoint" - for example we have event classes with a wait method which can block until a desired value. If the predicate is already met then sleep(0).

belm0 commented 5 years ago

With a bare bones implementation of HTTP get, I was able to reduce CPU due to these requests from 11.3% to 3.0% of my app total (per vmprof).

As mentioned, the original version used asks library. Requests are made at a rate of 30Hz, and response size is 220K; the connection is not SSL. My replacement uses Trio networking directly along with httptools for parsing.

class HttpResponseHandler:
    def __init__(self):
        self.message_complete = False
        self._body = []
        self.body = None

    def on_body(self, body: bytes):
        self._body.append(body)

    def on_message_complete(self):
        self.message_complete = True
        self.body = b''.join(self._body)

class TrioHttpConnection:
    def __init__(self, host, port):
        self.host = host
        self.port = port
        self.stream: trio.SocketStream = None

    async def get(self, path):
        if self.stream is None:
            self.stream = await trio.open_tcp_stream(self.host, self.port)

        RECEIVE_CHUNK = 1024 * 10
        request = '\r\n'.join([
            ' '.join(['GET', path, 'HTTP/1.1']),
            'Host: ' + self.host,
            '\r\n'
        ])
        await self.stream.send_all(bytes(request, encoding='utf-8'))
        response_handler = HttpResponseHandler()
        p  = httptools.HttpResponseParser(response_handler)
        while not response_handler.message_complete:
            reply = await self.stream.receive_some(RECEIVE_CHUNK)
            if reply == b'':
                break
            p.feed_data(reply)
        class response:
            content = response_handler.body
            status_code = p.get_status_code()
        return response

While implementing that I noticed that CPU use becomes high when receive_some chunk size is under some threshold. E.g. for 1K size, my performance was worse than asks, and at 10K it became significantly better than asks (diminishing returns beyond that). For asks, there is a receive_some size of 10,000 which appears sufficient-- the issue is more regarding h11 overhead (https://github.com/theelous3/asks/issues/108).

So at 1K size, clearly the 200+ Trio scheduling events are involved. It's an interesting and tangible case to look into as far as Trio library profiling. I'd like to try it against the cancellation refactoring PR #910 as well.

I have vmprof comparing 16K and 1K chunk size. Screenshots don't work well for comparison since vmprof UI is not great (segment lengths don't correspond to time). To summarize, in the 16K case, receive_some -> wrapper -> _nonblocking_helper is mostly balanced between aenter, aexit, wait_readable, and Python coroutine internals. In the 1K case, aenter, aexit dominate due to checkpoint_if_cancelled and cancel_shielded_checkpoint, respectively.

screen shot 2019-02-25 at 2 01 08 pm

screen shot 2019-02-25 at 2 01 17 pm

belm0 commented 5 years ago

Like handle_io, _nonblocking_helper() is another case marked as considerable portion of total run by vmprof, yet child percentages are negligible.

screen shot 2019-02-25 at 4 30 39 pm

These functions make it hard to see which areas of my code are actually problematic. E.g. some child of the root may show 10%, vs. a sibling of 3%, but the first is mostly _nonblocking_helper(), and we haven't determined if that time is going to idle time due to blocking, something in the kernel, etc.

It may be an artifact or bug in this particular PyCharm call tree display of the vmprof data. Things look fairly different in the vmprof UI, I don't see these unaccounted times.

belm0 commented 5 years ago

I tried PR #910, no impact on my app.

oremanj commented 5 years ago

Maybe try #958? (supersedes #910, tries to do fewer things at once, is probably faster). It doesn't have _pending_cancel_stack at all, so it definitely won't produce exactly the same graph. :-)

belm0 commented 5 years ago

@oremanj thank you. I tried #958, and no impact as well (just looking at process total CPU). It may reflect that the app doesn't have a cancel scope depth.

belm0 commented 5 years ago

I'll try to describe a major weakness I've learned about sampling profilers.

Sampling profilers are good for looking at relative CPU use between parts of a program within one session. For example, assuming stack-based sampling, it will reveal that all stacks at function A and below are taking roughly 10% of runtime. However, that number is highly dependent on how many times function A is called relative to other parts of the program. That will not be deterministic for any real program employing concurrency and which has nondeterministic input, so measurement will be highly variable.

Sampling profiles are not good for comparing the effects of minor code changes between sessions. By "minor" I mean not substantially changing the call graph, and not making a gross performance change in some hot code path. Unfortunately many of the important optimizations we make will be "minor" under this definition. (E.g. speeding up some often-used function by 25%, or detecting a 15% CPU regression in a 3rd party websocket package.) The reason is that the output of sampling profiles cannot be reasonably normalized to "how much CPU does a call to this function take", because they lack information on call counts.

You may conclude: use a traditional timing approach for measuring minor optimizations. However, that's currently hard to do for async functions-- which was one of the motivations for exploring profilers. So I hope the introduction of task-friendly timing primitives will help the situation (e.g. task_perf_counter).

njsmith commented 5 years ago

It's definitely true that sampling profilers can't be used on their own to measure speedups, because they don't measure speed. I feel like the most classic boring approach to optimization is:

Make a benchmark that measures something you care about. (For the kinds of problems that sampling profilers are aimed at, the "thing you care about" is usually some kind of CPU-bound throughput.)
Use a sampling profiler to get insight into what's impacting the number you measured in step (1).
Use the insight to change the code.
Run the benchmark again to measure the impact of your changes.

There are a lot of other approaches too of course – e.g. sampling profilers usually sample a random subset of CPU cycles, so they're most useful to discover where CPU cycles are going. Another popular approach for production distributed systems is to sample a random subset of requests, and take detailed traces of them, so that later on you can ask questions like "which % of our requests hit this cache?" or "for requests in the 99th latency percentile, which backend service are they getting stuck waiting for?"

I don't think I have enough information to understand why it's difficult to set up a reproducible benchmark for your system. But I agree that it's hard to do performance work without one!

belm0 commented 5 years ago

Writing benchmarks is a tall order for many scenarios-- application level code under rapid development, not-very-engineered domain-specific components, 3rd party code, ... Even for well-defined pieces like a networking library, it's hard to write a benchmark which represents reality, or otherwise which has the proper tooling for record and replay of real input.

I've only had task_perf_counter() for one day, and it became instantly useful.

case 1: the unneeded lock

While reviewing our websocket integration, I realized that we have a lock on send which is redundant, because trio-websocket is locking internally. Original method:

    async def send(self, cmd):
        send_metric.inc()
        send_bytes_metric.inc(len(cmd))
        async with self._send_lock:
            await self._ws.send_message(cmd)

So I'm curious-- what was the CPU cost of that lock? I use TrioPerfTimer, which is a context manager using task_perf_counter():

    @TrioPerfTimer('ws_send')
    async def send(self, cmd):
        send_metric.inc()
        send_bytes_metric.inc(len(cmd))
        async with self._send_lock:
            await self._ws.send_message(cmd)

and run the full application, as-is. On exit:

INFO perf timer "ws_send": avg 0.258, 50% 0.233, 90% 0.414, 95% 0.506 ms in 7819 runs

run again with the lock removed:

INFO perf timer "ws_send": avg 0.229, 50% 0.206, 90% 0.358, 95% 0.443 ms in 6004 runs

so the answer is ~30 usec per call.

case 2: the not-so-regression

The sampling profiler appeared to highlight a regression in receive-side code of trio-websocket and/or wsproto. I ran a number of sessions, it seemed real. It's 3rd party code, neither library has a benchmark suite.

So I pulled the trio-websocket source into my project, added TrioPerfTimer around the suspected code (body of reader_task loop), and ran the actual application with the old and new versions of trio-websocket and wsproto. Old:

INFO perf timer "reader_task_3": avg 0.668, 50% 0.354, 90% 0.610, 95% 0.738 ms in 6642 runs

New:

INFO perf timer "reader_task_3": avg 0.655, 50% 0.349, 90% 0.570, 95% 0.700 ms in 7326 runs

After several runs yielding similar results, the answer: no regression.

Being able to ask targeted performance questions of a real running application this easily is indispensable.

belm0 commented 5 years ago

For "case 2" above, I went back to look at how the sampling profiler output may have been misleading.

Perhaps it's counting some I/O blocking cases as using CPU. Specifically, I'm suspecting the "empty time" of Trio _epoll_wait() and/or wait_task_resheduled() (i.e. the periods where these calls are exactly at the bottom of the call stack). I don't understand what mysterious thing is consuming about 50% of _epoll_wait() execution time.

(I should take longer captures in the future-- _epoll_wait stacks only consist of 13 samples here.)

Screen Shot 2019-06-26 at 9 02 31 AM

njsmith commented 5 years ago

_epoll_wait does several things besides calling update_registrations and _wait_task_rescheduled... there are another ~10 lines of code in there. So I guess it's at least possible that it really does have 50% "self" time? And wait_task_rescheduled does have to allocate an object and might be getting dinged for the coroutine send/resume overhead? But this is wild guessing; I don't really know. To know more I think you'd have to dig into the details of how your profiler collects stack traces :-/

belm0 commented 5 years ago

Profiler is vmprof 0.4.15 on x86_64 GNU/Linux using default settings.

I might try instrumenting _epoll_wait() with TrioPerfTimer to get some opinion on actual timing.

belm0 commented 5 years ago

timing results from _epoll_wait:

"epoll_wait                           ": 50% ≤ 48.7 µs, 90% ≤ 74.7 µs, 95% ≤ 86.5 µs
"epoll_wait/wait_task_rescheduled     ": 50% ≤ 27.5 µs, 90% ≤ 44.4 µs, 95% ≤ 53.8 µs
"epoll_wait/sans_wait_task_rescheduled": 50% ≤ 20.8 µs, 90% ≤ 35.3 µs, 95% ≤ 42.0 µs
"epoll_wait/_update_registrations     ": 50% ≤ 13.5 µs, 90% ≤ 22.7 µs, 95% ≤ 28.2 µs

the BusyResourceError case was never hit

sumerc commented 4 years ago

Hey everyone,

I am the author of yappi and has just seen this thread by chance. I have been working on a coroutine-aware profiler support (for asyncio) for wall-time profiling for a while now. I would definitely would like to integrate trio, too. So, I have not been able to grasp all the details in this thread, could you guys give me a small summary of what you would like to achieve or do you still would like to see coroutine aware deterministic wall/cpu time profiling?

And for the current status: please see this: https://github.com/sumerc/yappi/issues/21

python-trio / trio

notes on profiling Trio programs #943

case 1: the unneeded lock

case 2: the not-so-regression