Idea: Produce intermediate representation from parsing

abingham commented 3 years ago

The core of this proposal is to introduce an intermediate form of parsed data between the stream and the screen. Rather than the screen feeding its parsed results directly to the screen, it would generate a stream of objects representing the parsed data, and these could be forwarded to the Screen API and potentially other clients. This IR could also be stored, analyzed, replayed, etc.

This idea came out of some work I was doing to learn more about control codes. In particular, I borrowed heavily (stole) from pyte's Stream class in my Parser implementation. I think this kind of thing could be introduced to pyte with full backwards compatibility, and it would mean I wouldn't need to duplicate Stream. I know this by itself isn't a very compelling argument for modifying pyte, but it might be useful in pyte as well (e.g. I saw some issues related to improving debugging).

In any event, I thought I'd float the idea and see what you thought. I should be able to do most of the coding, though of course I'd appreciate any guidance you've got.

huangyunict commented 2 years ago

+1 to this idea. In current pyte implementation, the stream and screen are tightly coupled. It is difficult to inject a customized processor between stream and screen.

milahu commented 2 years ago

factory functions can help this should have same performance as current impl

concept for a "reusable parser"

import re
import types

class Stream:
    def __init__(
        self,
        handle_text=None,
        parse_controls=None,
    ):
        class mock_listener:
            def draw(self, string):
                print("listener.draw", string)
        self.listener = mock_listener()
        self._text_pattern = re.compile("[a-z]+")
        self._taking_plain_text = True
        feed = self.make_feed(handle_text, parse_controls)
        self.feed = types.MethodType(feed, self) # bind method to instance

    def _send_to_parser(self, char):
        print("_send_to_parser", char)

    def make_feed(self, handle_text, parse_controls):
        """make a feed function for stream.feed(data)"""
        if not handle_text:
            handle_text = self.listener.draw
        if not parse_controls:
            parse_controls = self._send_to_parser
        def feed(self, data):
            """Consume some data and advances the state as necessary.

            :param str data: a blob of data to feed from.
            """
            send = parse_controls #send = self._send_to_parser
            draw = handle_text #draw = self.listener.draw
            match_text = self._text_pattern.match
            taking_plain_text = self._taking_plain_text

            length = len(data)
            offset = 0
            while offset < length:
                if taking_plain_text:
                    match = match_text(data, offset)
                    if match:
                        start, offset = match.span()
                        draw(data[start:offset])
                    else:
                        taking_plain_text = False
                else:
                    taking_plain_text = send(data[offset:offset + 1])
                    offset += 1

            self._taking_plain_text = taking_plain_text
        return feed

stream = Stream()

stream.feed("asdf 0123")

the parser is already generated by

self._parser = self._parser_fsm()

so we just need to modify _parser_fsm to accept custom handler functions

milahu commented 2 years ago

already possible: use a custom screen object

# pyte/test_parser.py
# python3 -m pyte.test_parser

# ansi color codes https://gist.github.com/Prakasaka/219fe5695beeb4d6311583e79933a009

#from pyte.screens import Screen, DiffScreen, HistoryScreen, DebugScreen
from .screens import Screen, DiffScreen, HistoryScreen, DebugScreen

#from pyte.streams import Stream, ByteStream
from .streams import Stream, ByteStream

terminal_width = 40
terminal_height = 4

class CustomScreen(Screen):
    def draw(self, *args):
        print("custom listener: draw", repr(args))
        super().draw(*args)
    def set_title(self, *args):
        print("custom listener: set_title", repr(args))
    def select_graphic_rendition(self, *args):
        print("custom listener: select_graphic_rendition", repr(args))

screen = CustomScreen(terminal_width, terminal_height)
stream = ByteStream(screen)

stream.feed(b"".join([
    b"\x1b", # esc = \e
    b"]", # osc
    b"2;new title", # params: 2, "new title"
    b"\x07", # bel = \a -> end of string

    b"\x1b", # esc
    b"[", # csi
    b"0;31", # params: 0, 31 -> red
    b"m", # select_graphic_rendition

    b"red", # text

    b"\x1b[0;32m", # esc csi green

    b"green", # text

    b"\x1b[0m", # reset style

    b"default", # text
]))

term_lines = screen.display[:] # copy array
for line_idx, line in enumerate(term_lines):
    print(f"{line_idx:4d} {line} ¶")

output

custom listener: set_title ('new title',)
custom listener: select_graphic_rendition (0, 31)
custom listener: draw ('red',)
custom listener: select_graphic_rendition (0, 32)
custom listener: draw ('green',)
custom listener: select_graphic_rendition (0,)
custom listener: draw ('default',)
   0 redgreendefault                          ¶
   1                                          ¶
   2                                          ¶
   3                                          ¶

superbobry commented 2 years ago

As @milahu points out, this should be doable without any changes to pyte.

The coupling between Stream and Screen is tight in a sense that the names of event handlers are fixed, but Stream does not assume anything about the implementation of Screen. So, you could have a custom Screen class which emits IR instructions instead of doing buffer manipulations. pyte.DebugScreen already does something like that, except that it logs the intercepted events to stderr.

abingham commented 2 years ago

So, you could have a custom Screen class

This was exactly the approach I took at first. It turned out that didn’t give me everything I needed, though. In particular, the information about precisely which bytes were parsed for each call to a Screen method was lost. I suspect that pyte itself wouldn’t benefit greatly from providing this kind of information, though, so there may not be a compelling argument for making it here.

milahu commented 2 years ago

precisely which bytes were parsed for each call to a Screen method

doable with near-zero overhead

https://github.com/milahu/pyte/tree/parser-pass-token-source

edit: fixed edgecase where token spans across two data buffers

$ git checkout master
$ BENCHMARK=tests/captured/htop.input python benchmark.py
htop.input->Screen: Mean +- std dev: 144 ms +- 5 ms
htop.input->DiffScreen: Mean +- std dev: 145 ms +- 5 ms
htop.input->HistoryScreen: Mean +- std dev: 378 ms +- 9 ms

$ git checkout parser-pass-token-source
$ BENCHMARK=tests/captured/htop.input python benchmark.py
htop.input->Screen: Mean +- std dev: 144 ms +- 5 ms
htop.input->DiffScreen: Mean +- std dev: 145 ms +- 4 ms
htop.input->HistoryScreen: Mean +- std dev: 379 ms +- 11 ms

example use

```py class CustomScreen(Screen): last_offset = 0 def draw(self, *args, source=""): print("custom listener: draw", repr(args)) # source == args[0] super().draw(*args) def set_title(self, *args, source=""): print("custom listener: set_title", repr(args), "source", repr(source)) def select_graphic_rendition(self, *args, source=""): print("custom listener: select_graphic_rendition", repr(args), "source", repr(source)) screen = CustomScreen(terminal_width, terminal_height) stream = ByteStream(screen) # ... # same code as above ``` output ``` custom listener: set_title ('new title',) source '\x1b]2;new title\x07' custom listener: select_graphic_rendition (0, 31) source '\x1b[0;31m' custom listener: draw ('red',) custom listener: select_graphic_rendition (0, 32) source '\x1b[0;32m' custom listener: draw ('green',) custom listener: select_graphic_rendition (0,) source '\x1b[0m' custom listener: draw ('default',) 0 redgreendefault ¶ 1 ¶ 2 ¶ 3 ¶ ```

I suspect that pyte itself wouldn’t benefit greatly from providing this kind of information, though, so there may not be a compelling argument for making it here.

yepp, for pyte this is just wasted cpu time but it would be nice to use the pyte source to compile such a parser https://stackoverflow.com/questions/56487216/how-can-i-convert-python-code-into-a-parse-tree-and-back-into-the-original-code

selectel / pyte

Idea: Produce intermediate representation from parsing #147