prompt-toolkit / pyvim

Pure Python Vim clone.
BSD 3-Clause "New" or "Revised" License
2.52k stars 160 forks source link

PyVim Slows Down on Large Files #131

Open SqrtRyan opened 4 years ago

SqrtRyan commented 4 years ago

I'm considering trying to build a new text editor with Prompt Toolkit, and I used PyVim to see how practical this would be. Unfortunately, although PyVim is wonderful, it chokes on files that have 10,000 lines of code upward (but only when using syntax highlighting). Native vim doesn't slow down on the same file; or files that are 100,000 lines upward (I'm talking about the time it takes for me to insert or delete a character). I think it has to do with the way prompt_toolkit handles syntax higlighting via pygments; and that it's not cacheing it as well as it could (because it slows down to unbearable speeds when you give it enough code at a time; vs vim which seems to maintain a constant speed regardless of how many lines of code are in the file you're editing) Because of this I've decided to wait out on building a text editor with prompt_toolkit until I'm assured that it's time complexity for syntax highlighting is not a function of file size; or at least, is so fast that I don't feel like it's lagging badly. Is there any way to get around this? (I want to edit large files in a prompt_toolkit buffer with syntax highlighting)

Thank you, Ryan

SqrtRyan commented 4 years ago

I also wanted to add that there are other python-based text editors such as Suplemon, which also don't slow down when editing large files with syntax highlighting (Suplemon isn't written with Prompt Toolkit, though). I really want to use prompt_toolkit though, because it's compatible with everything else I want such as PtPython, offering autocomplete prompts etc...

TheFern2 commented 4 years ago

One file with 10,000 LOC, that's a problem in itself lol.

alexzanderr commented 2 years ago

One file with 10,000 LOC, that's a problem in itself lol.

totally disagree. what if you want to inspect a javascript library like jquery or ajax? because you just want to analyze the code.

alexzanderr commented 2 years ago

anyway. the performance is crap, even with 300 LOC file. have you tried to run this with pypy3 or compile the code with pyinstaller or use numba jit to improve performance or compile the code with nuitka?

jonathanslenders commented 2 years ago

Hi all,

The main reason that the performance suffers on big files is because of the way the editor buffer is stored in prompt_toolkit. We are using a simple Python string to represent the buffer content. But Python strings are immutable, so every modification (like typing a single character) involves copying over the string into a new string. That doesn't work for big files.

To work around this, prompt_toolkit should use a "rope" or similar data structure, but this is far from trivial. Almost all code (like regex search, etc...) operates on Python strings.

Syntax highlighting is not an issue. Depending on the file type we have synchronization points. Prompt_toolkit looks for a start point close to the cursor position to start the highlighting: https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/prompt_toolkit/lexers/pygments.py#L113

Now, I think it's important to notice that prompt_toolkit was not designed from the ground up to become a real text editor. It was mainly a Readline replacement. It just happened to include all capabilities to build a text editor on top of it, as long as the size of the text files was reasonable.

Now, if somebody is willing to implement a "rope" data structure or similar (I don't know the trade-offs), ideally without 3rd party dependencies, in a clean way, unit-tested, and backward-compatible with the rest of the code, I would consider adopting it. But it's certainly not trivial. Especially that Python's regex engine expects plain strings.

alexzanderr commented 2 years ago

you didnt answer my questions ..

RyannDaGreat commented 2 years ago

@jonathanslenders Hi, thank you for responding!

I don't think the current bottleneck is caused by the immutability of strings; I still think it's because we retokenizing the buffer each time an edit is made.

The main reason I think syntax highlighting is the culprit is because with syntax highlighting off, it gets a lot faster. Edits only seem to lag badly when syntax highlighting is turned on. This hypothesis was supported when I tried using py-spy to profile pyvim as it was running - it showed pygments code taking up a lot of time.

I understand the reasoning though - the time complexity of making any tiny edit to a string that's immutable is O(n) with respect to the length of the string, and using a rope datastructure could turn that into O(length of the edited line + log(number of lines)). However, python's string editing is very fast. And though I agree - to make this a truly fast text editor we probably will need some kind of datastructure like that, perhaps using the blist module - I don't think that's the current bottleneck.

For example, try running this code:

import time
s=('b'*1000+'\n')*10000
start=time.time()
for _ in range(1000):
    s='a'+s
end=time.time()
print(end-start)

This takes a string with 10,000 lines of code, each with 1000 characters, and makes one 1000 modifications to it in the worse case (where a character is added to the beginning of the string, which python handles very slowly in comparison to adding a character to the end of a string). On my computer this takes almost exactly 1 second (it prints 1.0464394), so each edit took about a millisecond. Surely these 1000 edits are much more than what goes on when making a single keystroke edit in a 10,000 line file? Sometimes the lag when editing large files in PyVim with syntax highlighting turned on can be larger than .25 seconds on my computer.

I'm not sure how this performance would generalize to regex expressions, but if I recall correctly I don't think basic edits in the buffer class such as inserting or deleting characters make use of the regex module. These basic edits (such as inserting characters, using arrow keys and pressing backspace) are what I was testing when experiencing lag.

I don't know how to fix this problem, but I'm hoping this will point us in the right direction.

You did mention synchronization points though - I'm not quite sure what that means, but does that mean it doesn't need to retokenize the entire string on every modification?

jonathanslenders commented 2 years ago

@alexzanderr: Actually I have tried pypy with success. I don't recall any numbers, but it was definitely a bit faster than cpython back then.

@RyannDaGreat: Yes, this benchmark looks pretty fast indeed. I have to admit that I don't recall the details. My conclusion was also that it was in general not worth the effort to choose another data structure, because Python strings are actually really fast. But there is also memory. For big files, memory usage increases very fast while editing. Pygments is indeed slow tokenizing, which is why I added those synchronization points to the lexer, so that we only have to highlight the visible region. What file type are you testing with?

RyannDaGreat commented 2 years ago

@jonathanslenders I was testing it with Python files, in particular a few large ones such as

https://github.com/RyannDaGreat/rp/blob/master/r.py