ronisbr / TerminalPager.jl

Pure Julia implementation of the command less
MIT License
116 stars 8 forks source link

pager seems not to cache things properly for large tables #10

Open bkamins opened 3 years ago

bkamins commented 3 years ago

Steps to reproduce:

pager(rand(10^6, 10))

and it takes 30 seconds to display.

I am not sure if something can be done to fix it, but ideally large tables should get lazy loading feature.

ronisbr commented 3 years ago

Yes, there is a lot of room for improvement here.

  1. I process everything in the beginning. This process split the text into lines and store them. The good thing is that when searching, I can find all the matches. The bad one is that it can be very slow. I need to think about a good way to do that and process the rows on-the-fly, when necessary.
  2. For things with very, very large lines, it will always be slow (all text editors suffers from it). However, we can improve if we define a Julian format for text, where the output is created with the rows already split, avoiding this initial processing. It can be an internal format that can be passed to pager.
ronisbr commented 3 years ago

Hum, for this specific example rand(10^6, 10) what is taking too long if the string generation by sprint. Take a look:

julia> str = sprint(show, MIME"text/plain"(), rand(10^6,10));

julia> pager(str)

Hence, this kind of thing should be very difficult to fix. Any ideas?

bkamins commented 3 years ago

My thinking was that the pager should do lazy processing. So if you have a table that has e.g. 10^6 rows and 1000 columns and you know your terminal size then you load only part of it (e.g. 10^4 rows and 100 columns - depending on the terminal size) and then dynamically load other chunk when the text is moved. I would make the chunk size large enough so that "normal" small tables are still loaded in one shot as then you will have a small hiccup when moving to a new page.

The only issues are:

  1. I am not sure if it is possible at all with your current design (as I feel - I have not checked the sources - that pager might not be aware of row/column structure of the input and has to ingest it in one shot)
  2. if this would be done then there would be a slight problem with freeze for DataFrames.jl that I have requested (as it is possible that the width of the "Rows" column then would change)
ronisbr commented 3 years ago

Hum, your idea is to convert the data to string on-the-fly right? I think this can work for some cases, but not for others. For example, for documentation @doc(pretty_table), I just get a string and there is no information about the size. What I can do is to create an internal structure to wrap those types that I can infer the size and render (convert to string) part by part. If I cannot infer the size, I fallback to the current behavior.

This will be challenging. For Julia matrices, this can work since I can render just a view (a[500:1000, 500:1000]), although we can have changes in the column size depending on the part we are converting, I am not sure how to handle this. However, for DataFrames, there is no way to render the middle of the table.

I am thinking if we can do this for the lines only instead of columns. It seems easier I think.

The ultimate solution is to create an API more or less like Table.jl for outputs that want to be rendered inside a pager. Something like a file structure that has important meta information and possibly can be rendered on-demand.

bkamins commented 3 years ago

although we can have changes in the column size depending on the part we are converting, I am not sure how to handle this.

You could have some "discontinuity" in drawing (i.e. jump because of the changing column with), but I guess it is better than not being able to see it at all.

However, for DataFrames, there is no way to render the middle of the table.

You can do exactly the same - use view(df, 500:1000, 500:1000]).

ronisbr commented 3 years ago

The problem with this view is that the rows always starts at 1. Hence, it just can't see how to reproduce a true "view" of a Data frame without knowing that it is a Data Frame :D

bkamins commented 3 years ago

Ah - right :(. Anyway - if you will have some ideas here what could be done (I am sorry for not being more helpful but I did not have time to dig into TerminalPager.jl internals) please let me know and I will gladly review (e.g. we could provide an internal function that would properly produce row numbers).

rafaqz commented 1 year ago

+1 for this

Currently trying pager on really large tables hangs, and Ctrl-C to get out kills the session.

ronisbr commented 1 year ago

Yes, I have to think a clever way to solve this problem...

One very "simple" solution is outside TerminalPager. We can make PrettyTables render the table with constant column width. Hence, it will feed "chunks" of the table to TerminalPager. We will lost the ability to search, but I think it is a good tradeoff. Today, TerminalPager just renders the entire table, which can take very long and consume all your memory :(

Unfortunately, I did not have time yet to think about a good API for the general case.

bkamins commented 1 year ago

A simple approach is to add kwarg allowing the user to switch between these two behaviors and the default value could depend dynamically on the size of the passed table.

rafaqz commented 1 year ago

Having the default check the table length somehow sounds good.

Im happy to lose search, but wondering if it could also happen lazily over the original table columns?