hughjonesd / huxtable

An R package to create styled tables in multiple output formats, with a friendly, modern interface.
http://hughjonesd.github.io/huxtable
Other
321 stars 28 forks source link

as_Workbook is slow #77

Closed md0u80c9 closed 4 years ago

md0u80c9 commented 6 years ago

Hi,

Sorry - another really simple suggestion for huxtable.

If the output is going to take more than, say 2s, would it be possible to introduce a progress indicator (even one which is optionally disabled)?

Jim Hester's ReadR is a nice example of how to make one kick in if the process time will be long.

hughjonesd commented 6 years ago

I think a better answer would be to make it work faster. What come on is it taking long? -- Sent from Gmail Mobile

md0u80c9 commented 6 years ago

Well that sounds even better! I've not done anything complex, and am using a 2015 MacBook Pro so reasonable equipment. But converting a 28x133 huxtable seems to take a surprisingly long time. I can send you the sample table if you like (just let me know where and I can do that later today) in case it is something funny with the table (shouldn't be as I don't think I'm doing anything unusual).

hughjonesd commented 6 years ago

Can you post a MWE? -- Sent from Gmail Mobile

md0u80c9 commented 6 years ago

I can send you the table - it is fairly minimal (though clearly size itself is a factor affecting processing speed). Not sure. Can ‘minimise’ it more than it is and still understand what is slow compared to what is expected.

hughjonesd commented 6 years ago

The end product isn't useful, but the code producing it would be. Better still would be a simple example of a huxtable that takes a long time to turn into a workbook.

On Sat, 11 Aug 2018 at 14:37, Andrew Hill notifications@github.com wrote:

I can send you the table - it is fairly minimal (though clearly size itself is a factor affecting processing speed). Not sure. Can ‘minimise’ it more than it is and still understand what is slow compared to what is expected.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/hughjonesd/huxtable/issues/77#issuecomment-412275628, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjJ9zZ41N4cpVLJQpGwfQeGQHdjKtJJks5uPt4rgaJpZM4V4zKk .

-- Sent from Gmail Mobile

md0u80c9 commented 6 years ago

So the huxtable itself is actually simple - there is no formatting or anything.

It's being created from a tibble (which is formed from a complex package as each single result is an aggregated statistic; but the output should be a 28x133 huxtable consisting of almost all floating values. We do transpose the huxtable itself using t before then rendering (not sure how much that adds).

I'll try in the next couple of days to test whether I can demonstrate this by just faking the tibble.

hughjonesd commented 6 years ago

No worries, I can reproduce:

tmp <- as_hux(matrix(rnorm(20*100), 20, 100))
System.time(as_Workbook(tmp))
md0u80c9 commented 6 years ago

Fab - does that seem to go slow?

hughjonesd commented 6 years ago

Yup, takes about 20 seconds on my laptop. Clearly not good enough. I'll see what I can do but it may not be till holidays. The basic problem is that each cell is added individually. That's fair, because huxtables can contain arbitrary data.

Surely for you the obvious question is, why are you using huxtable if you aren't using any of the formatting features? Why not just use openxlsx and get a 1000x speedup?

 microbenchmark::microbenchmark({
  tmp <- createWorkbook()
  addWorksheet(tmp, 'sheet1')
  writeData(tmp, 'sheet1', matrix(rnorm(28*133),28,133))
}, times = 100)

Unit: milliseconds
                                                                                                                           expr
...
      min       lq     mean   median       uq      max neval
 23.18982 24.38674 28.83979 25.17363 26.09416 70.91882   100
md0u80c9 commented 6 years ago

Oh that's an easy answer: because I will use the formatting - just not yet in the project. I have used openxlsx in another part of the project - and it works well. Until someone asks you to then also render the same table in another format and you have to write two formatters.

I did wonder whether at some point there would be benefit in looking at the writexl package (https://github.com/ropensci/writexl) but it doesn't look like it's seen active development for a while, and my main project itself is absorbing enough time that I doubt I'll be able to enhance any of the dependencies of the project for a little while (though I'm building up a nice list of things that 'could' be done!)

hughjonesd commented 6 years ago

I get it. -- Sent from Gmail Mobile

hughjonesd commented 6 years ago

Best strategy is probably to write data by columns. Columns are more likely to share the same type (beyond a header row) and to be long.

md0u80c9 commented 6 years ago

We should know what the type definition of a column is (or a row if transposition is just setting a flag to write columns rather than actually transposing the data).

Another option maybe to group similar data type columns together. Plot all the columns in a single operation and then rearrange them afterwards: the rearrange should be a cheap option.

hughjonesd commented 6 years ago

The basic issue is how to call writeData en masse when huxtables can contain arbitrary data (e.g. a column of numbers with a character heading will be character). I'm on holiday now, but patches welcome if you want to have a go at it. At the moment it is very crude, just a for loop.

On Sun, 12 Aug 2018 at 09:49, Andrew Hill notifications@github.com wrote:

We should know what the type definition of a column is (or a row if transposition is just setting a flag to write columns rather than actually transposing the data).

Another option maybe to group similar data type columns together. Plot all the columns in a single operation and then rearrange them afterwards: the rearrange should be a cheap option.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/hughjonesd/huxtable/issues/77#issuecomment-412328360, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjJ91aHZeL4IQj60SXF_XgYghr_Gy7rks5uP-wcgaJpZM4V4zKk .

-- Sent from Gmail Mobile

md0u80c9 commented 6 years ago

I’m on holiday too! Did a bit of work on my project to avoid the midday sun! If I get time to play with the code, I will do. Can’t promise though as the project I’m working on is behind schedule itself 🙁

hughjonesd commented 6 years ago

Try github master and report any speedup/issues.

md0u80c9 commented 6 years ago

Just tried it now - doesn't seem to be any better speed wise. Just checking I got the correct repo:

Downloading GitHub repo hughjonesd/huxtable@master
from URL https://api.github.com/repos/hughjonesd/huxtable/zipball/master
hughjonesd commented 6 years ago

That's right. Check with packageVersion, though. It should end in 9000. Could you report the results of:

microbenchmark::microbenchmark(as_Workbook(as_hux(matrix(rnorm(1000), 10, 100))), times = 10)

with github master and with 4.1.0?

hughjonesd commented 6 years ago

Code is churning a bit so wait

hughjonesd commented 6 years ago

Try now.

md0u80c9 commented 6 years ago

Before:

Unit: seconds
                                              expr      min       lq    mean   median       uq
 as_Workbook(as_hux(matrix(rnorm(1000), 10, 100))) 1.619138 1.688668 1.75137 1.719496 1.789264
      max neval
 2.068814    10

After:

Unit: seconds
                                              expr      min       lq     mean   median
 as_Workbook(as_hux(matrix(rnorm(1000), 10, 100))) 1.579386 1.588832 1.656722 1.621283
       uq      max neval
 1.673724 1.934853    10
hughjonesd commented 6 years ago

Odd. Mine are:

Before:

                                               expr      min       lq     mean   median       uq
as_Workbook(as_hux(matrix(rnorm(1000), 100, 10))) 4.473533 4.707627 5.235132 4.940469 5.200724
       max neval
7.803099    10

After:

                                               expr      min       lq     mean   median       uq
as_Workbook(as_hux(matrix(rnorm(1000), 100, 10))) 1.566315 1.678421 1.882421 1.725201 2.092803
       max neval
2.378984    10

On that basis I'm going to keep the changes. You may find they help you with your many-columns case also.

md0u80c9 commented 6 years ago

Had a good look at the code this morning having returned from holiday. The main reason for the slowness will be the for next loops going over each cell - so the result will be code which is difficult to scale.

If we vectorise a few of the functions, we should be able to get the code just to operate on the entire huxtable which should offer significant speedups.

Will have a look at some code chunks and see if I can help with that and I'll push some changes as PRs to you.

hughjonesd commented 6 years ago

You're right in general - this is what I have been doing with to_latex and to_html. But bear in mind that huxtables can have formatting etc. in arbitrary cells, while openxlsx::createStyle takes only a vector of rows and columns to specify a rectangle. The ideal strategy would be to identify rectangles with similar styles, but this could be quite messy. My current change just memoises createStyle.

A quicker win might be to figure out the difference between openxlsx::writeData with numeric and character data. The ideal would be to call writeData once with the whole huxtable, then apply different styling to cells where is_a_number returns TRUE. This would be simpler than the current rather ugly kludge, and probably faster too.

md0u80c9 commented 6 years ago

That's a good idea.

clean_contents could be vectorised more easily which would save a fair chunk of time. Format_numbers isn't far off that - it should eventually be callable with purrr::map2 or something like that rather than the double for loop.

Can't quite follow what na_string is doing in this context - and whether that can be merged into format_numbers (ie. we already detect NAs so act on them differently). I thought I'd got rid of it but then one of the unit tests failed so haven't been able to do so stably :-(.

md0u80c9 commented 6 years ago

I've uploaded a PR to change the nested if in format_numbers() to use a set of small generic functions which looks like it helps a bit (also is neater). I suspect we could fold all of format_numbers into the generic functions and do away with the function altogether - but haven't managed to do that in a way which passes all the unit tests yet (stupid error or failing to factor something in I'm sure).

hughjonesd commented 6 years ago

na_string is simply a property, see ?na_string. It allows users to replace NAs in the data with something appropriate: "--" or "N/A".

md0u80c9 commented 6 years ago

So could we use the replace function over the whole lot rather than checking each one individually?

hughjonesd commented 6 years ago

Yes! na_string(ht) will give a matrix of strings for the whole huxtable.

hughjonesd commented 6 years ago

By the way, you might want to take a look at TODO.md. It has most plans and/or ideas - some should probably be issues.

md0u80c9 commented 6 years ago

Just tried vectorising the first half of clean_contents but I'm triggering a number of unit test fails - not quite sure what is failing.

The code insert is as follows (with the intent of losing the double for loop):

  contents <- as.matrix(as.data.frame(ht))

  contents <- mapply(format_numbers,
                     string = contents,
                     num_fmt = number_format(ht))
  for (col in seq_len(ncol(contents))) {
    for (row in seq_len(nrow(contents))) {
      cell <- contents[row, col]
#      num_fmt <- number_format(ht)[[row, col]] # a list element, double brackets
#      if (! is.na(cell)) cell <- format_numbers(cell, num_fmt)

The test fails mainly look like:

test-attributes.R:11: warning: Cell property attr top_border examples unchanged
first element used of 'length.out' argument

Any thoughts?

md0u80c9 commented 6 years ago

Made a further PR: I haven't managed to completely remove those loops (I am sure it's doable but I keep breaking things when I change the remaining lines).

Haven't written down the test results as I'm using my iMac this evening but it has definitely had a non-trivial impact on the benchmark test.

hughjonesd commented 4 years ago

Closing.