prompt-toolkit / pypager

A $PAGER in pure Python, similar to "less".
BSD 3-Clause "New" or "Revised" License
85 stars 19 forks source link

Viewing Large Files #7

Open haydenflinner opened 4 years ago

haydenflinner commented 4 years ago

Hi,

First just wanted to say great job on this app. The code here as well as in prompt-toolkit is exactly the sort of idiomatic code I thought must be out there somewhere, and I'm glad I put off my less rewrite until I found this. There's only one thing missing from your pager that I need from less; viewing large files.

Less, when you press G, does a seek to the end of the file. Then it tries to calculate line numbers for you if enabled, but you can CTRL+C to stop that and just show the last screenful of the file. Reading the code, it seems that it keeps a sort of linked list of loaded portions of the file, for quick jumping around, e.g. G gg is near instant no-matter the size of the file. Additionally, pressing G takes you to the end of the file. In pypager, I've found that pressing G takes me to some point deep in the file, I assume when a read-timeout has finished, and pressing G again takes me deeper still, but not yet to the end. From reading the code I haven't seen any special approach for handling large files, it seems the file is just treated as one big text buffer. I'd like to implement low-resource reading of large files (Unix-only for me), and I was wondering if you had any thoughts on where to get started or gotchas that make it exceedingly difficult. I'm thinking I will start with trying a simple mmapped file, so that G does in fact go to the end of file, then see how searching performance works. I think mmap's caching will be enough to get a 90% improvement. Then if that's not good enough I can look at the more advanced semantic caching that less does, plus my own scheme that I think less should do but it doesn't.

jonathanslenders commented 4 years ago

Supporting large files should be possible with several changes. We probably can't use BufferControl anymore, and should use FormattedTextControl instead.

There are tricks with mmap indeed, we can build an index of line endings by running a regex over an mmapped file. Then only read the lines that need to be displayed. This is pretty efficient and supports multi-gigabyte files. Right now, the limit is around a few thousand lines.

Syntax highlighting on big files is a challenge, but probably can be done without too much effort. I'm not sure yet how much it will take to use prompt_toolkit's Lexer or a FormattedTextControl.

I'm not sure about reading large files from stdin.

haydenflinner commented 4 years ago

Awesome, thanks for the pointers. Will take another look at this today. Large files from stdin I'm not worried about, if they're too large for memory you probably shouldn't be piping them over stdin 😅