ronisbr / PrettyTables.jl

Print data in formatted tables.
MIT License
404 stars 38 forks source link

max_num_of_rows not taking effect in html backend #217

Closed juliohm closed 11 months ago

juliohm commented 1 year ago

Hi @ronisbr , hope you are doing well,

We are facing an issue with the html backend because it is materializing entire columns even when we pass the option max_num_of_rows. Is there a quickfix or workaround to avoid building the entire columns in this backend?

The text backend is working just fine and we can create very large tables and display just a few rows quickly. The problem is in the creation of the objects, not the display itself.

ronisbr commented 1 year ago

Hi @juliohm !

Everything if fine here, how about you?

Can you provide a MWE? Because this option is working as expected here and also in DataFrames.jl. Take a look at this example:

julia> a = rand(10_000, 10);

julia> @time pretty_table(IOBuffer(), a; tf = tf_html_default, max_num_of_rows = 10)
  0.078409 seconds (350.23 k allocations: 24.639 MiB, 99.56% compilation time)

julia> @time pretty_table(IOBuffer(), a; tf = tf_html_default, max_num_of_rows = -1)
  0.178180 seconds (4.59 M allocations: 228.235 MiB)

PrettyTables.jl uses the same internal handling (a type I call ProcessedTable) in all backends to limit the number of data that will be rendered.

Can it be something specific to your type that somehow triggers the entire rendering when a cell is called with MIME("text/html") ?

ronisbr commented 1 year ago

The last time I saw something like this was caused due to the Tables.jl implementation of the type, that was leading to entire column rendering. Take a look here please:

https://github.com/ronisbr/PrettyTables.jl/issues/162

juliohm commented 1 year ago

I am doing fine, thanks! :)

Here is a MWE that you can try on Jupyter (I tried it on VSCode directly because it supports notebooks):

using GeoTables

georef((a=rand(1000,1000),))

It takes 43s for me.

juliohm commented 1 year ago

Here is the precise location of the call:

https://github.com/JuliaEarth/GeoTables.jl/blob/7bc56e5135e48900861d99f4b1a5599bed1eeeb6/src/abstractgeotable.jl#L361-L363

juliohm commented 1 year ago

I think it is our fault! Splatting on the _common_kwargs is causing the problem 🤦🏽‍♂️

juliohm commented 1 year ago

But I don't know why. The _common_kwargs function returns very small objects, it shouldn't be a bottleneck.

juliohm commented 1 year ago

@ronisbr can you please confirm the semantics of the option max_num_of_rows? We understood that this option can be used to omit rows in very tall tables to save screen space and avoid materialization of all related objects.

Apparently, if we set the maximum number to 10 and the table has 1000000 rows, no information is shown to indicate the omitted rows:

image

ronisbr commented 1 year ago

Hi @juliohm !

Yes, this is the expected behavior. There is some room for improvement when using max_num_of_* in the text backend. The biggest issue is that this backend has two ways of limiting the number of data to be printed: max_num_of_* and the display size. The internal algorithm should change significantly to solve those inconsistencies.

The advice is to use max_num_of_* only in the LaTeX and HTML, and let the text backend limit the data using the display size.

I will check why it is taking too much time in that function.

ronisbr commented 1 year ago

Hi @juliohm !

I understood the problem!

When you use vcrop_mode = :middle, we render the initial and the last rows. To obtain the last rows using Tables.jl API, we need to iterate the rows in each column:

https://github.com/ronisbr/PrettyTables.jl/blob/b72c0852289dac7b3bc100aa117de74acdda0fcf/src/tables.jl#L106

This process is taking too long for GeoTables. If you change vcrop_mode = :bottom, everything is very fast.

The vast majority of the time, the algorithm is in this function here:

https://github.com/JuliaEarth/GeoTables.jl/blob/7bc56e5135e48900861d99f4b1a5599bed1eeeb6/src/abstractgeotable.jl#L129

juliohm commented 1 year ago

Thank you @ronisbr ! Implementing the suggestions now. Will report back if the issue is still present.

Feel free to change the title of the issue to a more appropriate name :)

ronisbr commented 1 year ago

Thanks for the report! I think I can greatly improve everything if I keep a cache of the state. I will try to do some modifications here and hope it will not introduce a lot of type instabilities.

My ideia is:

  1. Every time we access a row in Tables.jl (row access), I will store the states.
  2. If we have a state close to the row we want, we just use the cache instead of iterating everything.

This algorithm will greatly improve things for this case. However, my concern is that it will allocate a lot more for the other scenarios in which the iteration is much faster.

ronisbr commented 1 year ago

Hi @juliohm!

I could reduce the time from several seconds to 0.001s considering that GeoTables state in the iteration is the row number. I think we should open a issue in Tables.jl to add a information that a RowTable state is the row number, which will drastically reduce the iterations, leading to a substancial gain.

ronisbr commented 1 year ago

By the way, if Tables.jl did not add this hint, I will add a keyword probably called: row_tables_jl_use_row_id_as_state, which will solve the problems for GeoTables!

juliohm commented 1 year ago

Thank you Ronan! That is very helpful! This package is truly amazing and only gets better!

Em qua., 6 de set. de 2023 12:44, Ronan Arraes Jardim Chagas < @.***> escreveu:

By the way, if Tables.jl did not add this hint, I will add a keyword probably called: row_tables_jl_use_row_id_as_state, which will solve the problems for GeoTables!

— Reply to this email directly, view it on GitHub https://github.com/ronisbr/PrettyTables.jl/issues/217#issuecomment-1708638780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3P5Q7OIOTY2B4OXCGDXZCK4JANCNFSM6AAAAAA4MOJ5AA . You are receiving this because you were mentioned.Message ID: @.***>

ronisbr commented 11 months ago

I think this issue is solved with the new code that uses Table.subset, leading to a huge gain when printing with middle cropping.