patch_stdout is surprisingly slow

vlovich commented 6 years ago

patch_stdout slows down the program (by ~5x) if there's a lot of output which seems really large considering that what it's trying to do conceptually doesn't seem too complicated.

from prompt_toolkit.completion import WordCompleter
from prompt_toolkit.eventloop.defaults import use_asyncio_event_loop
from prompt_toolkit.patch_stdout import patch_stdout
from prompt_toolkit.shortcuts import CompleteStyle
from prompt_toolkit.shortcuts.prompt import prompt
from threading import Thread
from prompt_toolkit.application import get_app

import sys

def spam_print():
  i = 1
  while True:
    print("This is spam", i)
    i += 1
    if i == 1000:
      try:
        get_app(raise_exception=True).exit(exception=KeyboardInterrupt())
      except:
        sys.exit(0)
      return

thread = None

while True:
  with patch_stdout():
  # with open("/dev/null"):
    if not thread:
      thread = Thread(target=spam_print)
      thread.start()
    prompt("> ", rprompt="test", complete_style=CompleteStyle.MULTI_COLUMN,
            completer=WordCompleter(list(chr(c) * 20 for c in range(ord("a"), ord("z")))))

time python3 test.py

Without patch_stdout: ~0.2s With patch_stdout: ~1s (including getting hit by #681).

Attached captured profile but viewing it in snakeviz doesn't seem to immediately point me to the source of performance issues (aside from that it does look like there's a hotspot). Maybe run_until_complete in patch_stdout prints 1 line at a time instead of all lines that have been buffered?

profile.bin.zip

profile_without_patch_stdout.bin.zip

jonathanslenders commented 6 years ago

This largely depends on the amount of print statements. Not so much on the text length of the output. For every individual write to stdout, the prompt will be erased, the output is written to the real stdout, and the prompt is rendered again. Prompt rendering will probably take most of the time here.

The rendering is highly optimized, so I don't think that part can be improved. But what we could do is use timers, and only render to stdout when no new output is written in like 200ms, and then write everything at once. This will introduce a little more latency for the simple cases, and make the capturing more complex, but when we have that, using patch_stdout shouldn't effect the performance anymore.

vlovich commented 6 years ago

I think even just batching it would help drastically cause right now it happens 1 line at a time (finds a line in stdout, schedules append + repaint).

vlovich commented 6 years ago

NVM. You're right. Need a timeout timer here. However, I tried adding that & it didn't seem to really help so I think there's something else going on.

vlovich commented 6 years ago

Found a workaround, at least when printing through logging:

class PatchableStderrStream(object):
  def write(self, data):
    sys.stderr.write(data)

  def flush(self):
    # Uncomment for performance problem:
    # sys.stderr.flush()
    pass

stdout_ch = logging.StreamHandler(PatchableStderrStream())
logging.getLogger().addHandler(stdout_ch)

This fixes the performance issue we see when using logging.

vlovich commented 5 years ago

Tracked this down a bit more. Here are the hotspots I've found:

HSplit._divide_heights. This is called very frequently when there are lots of things printing to the screen. This information doesn't change frequently (ever?) so caching it would save quite a bit. All the CPU time appears to be sent calling preferred_height for the children. I'm guessing that preferred_height can be memoized within each object. As a coarse optimization I changed _divide_heights to only recompute once a second and that provided 2x gain (20% down to ~13%).

In HSplit.write_to_screen, if I completely skip the "# Draw child panes. " loop, everything renders correctly AFAICT but there's a reduction in the CPU time from 20% to 5% for it.

I suspect that the prompt session is creating a complex layout that's expensive to render & it's not memoizing as much as it can in various places.

vlovich commented 5 years ago

One easy top-level fix is to buffer in-between frames in patch_stdout. Here's a stderr handler I direct my stdout logging StreamHandler to: https://gist.github.com/vlovich/9c5aecda35a287276455e0a6323e9ba8

It requires running the prompt toolkit in asyncio but I'm sure there's a way to adopt something like this natively within patch_stdout (or lower).

prompt-toolkit / python-prompt-toolkit

patch_stdout is surprisingly slow #682