Closed md0u80c9 closed 4 years ago
I think a better answer would be to make it work faster. What come on is it taking long? -- Sent from Gmail Mobile
Well that sounds even better! I've not done anything complex, and am using a 2015 MacBook Pro so reasonable equipment. But converting a 28x133 huxtable seems to take a surprisingly long time. I can send you the sample table if you like (just let me know where and I can do that later today) in case it is something funny with the table (shouldn't be as I don't think I'm doing anything unusual).
Can you post a MWE? -- Sent from Gmail Mobile
I can send you the table - it is fairly minimal (though clearly size itself is a factor affecting processing speed). Not sure. Can ‘minimise’ it more than it is and still understand what is slow compared to what is expected.
The end product isn't useful, but the code producing it would be. Better still would be a simple example of a huxtable that takes a long time to turn into a workbook.
On Sat, 11 Aug 2018 at 14:37, Andrew Hill notifications@github.com wrote:
I can send you the table - it is fairly minimal (though clearly size itself is a factor affecting processing speed). Not sure. Can ‘minimise’ it more than it is and still understand what is slow compared to what is expected.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/hughjonesd/huxtable/issues/77#issuecomment-412275628, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjJ9zZ41N4cpVLJQpGwfQeGQHdjKtJJks5uPt4rgaJpZM4V4zKk .
-- Sent from Gmail Mobile
So the huxtable itself is actually simple - there is no formatting or anything.
It's being created from a tibble (which is formed from a complex package as each single result is an aggregated statistic; but the output should be a 28x133 huxtable consisting of almost all floating values. We do transpose the huxtable itself using t before then rendering (not sure how much that adds).
I'll try in the next couple of days to test whether I can demonstrate this by just faking the tibble.
No worries, I can reproduce:
tmp <- as_hux(matrix(rnorm(20*100), 20, 100))
System.time(as_Workbook(tmp))
Fab - does that seem to go slow?
Yup, takes about 20 seconds on my laptop. Clearly not good enough. I'll see what I can do but it may not be till holidays. The basic problem is that each cell is added individually. That's fair, because huxtables can contain arbitrary data.
Surely for you the obvious question is, why are you using huxtable if you aren't using any of the formatting features? Why not just use openxlsx and get a 1000x speedup?
microbenchmark::microbenchmark({
tmp <- createWorkbook()
addWorksheet(tmp, 'sheet1')
writeData(tmp, 'sheet1', matrix(rnorm(28*133),28,133))
}, times = 100)
Unit: milliseconds
expr
...
min lq mean median uq max neval
23.18982 24.38674 28.83979 25.17363 26.09416 70.91882 100
Oh that's an easy answer: because I will use the formatting - just not yet in the project. I have used openxlsx in another part of the project - and it works well. Until someone asks you to then also render the same table in another format and you have to write two formatters.
I did wonder whether at some point there would be benefit in looking at the writexl package (https://github.com/ropensci/writexl) but it doesn't look like it's seen active development for a while, and my main project itself is absorbing enough time that I doubt I'll be able to enhance any of the dependencies of the project for a little while (though I'm building up a nice list of things that 'could' be done!)
I get it. -- Sent from Gmail Mobile
Best strategy is probably to write data by columns. Columns are more likely to share the same type (beyond a header row) and to be long.
We should know what the type definition of a column is (or a row if transposition is just setting a flag to write columns rather than actually transposing the data).
Another option maybe to group similar data type columns together. Plot all the columns in a single operation and then rearrange them afterwards: the rearrange should be a cheap option.
The basic issue is how to call writeData en masse when huxtables can contain arbitrary data (e.g. a column of numbers with a character heading will be character). I'm on holiday now, but patches welcome if you want to have a go at it. At the moment it is very crude, just a for loop.
On Sun, 12 Aug 2018 at 09:49, Andrew Hill notifications@github.com wrote:
We should know what the type definition of a column is (or a row if transposition is just setting a flag to write columns rather than actually transposing the data).
Another option maybe to group similar data type columns together. Plot all the columns in a single operation and then rearrange them afterwards: the rearrange should be a cheap option.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/hughjonesd/huxtable/issues/77#issuecomment-412328360, or mute the thread https://github.com/notifications/unsubscribe-auth/AFjJ91aHZeL4IQj60SXF_XgYghr_Gy7rks5uP-wcgaJpZM4V4zKk .
-- Sent from Gmail Mobile
I’m on holiday too! Did a bit of work on my project to avoid the midday sun! If I get time to play with the code, I will do. Can’t promise though as the project I’m working on is behind schedule itself 🙁
Try github master and report any speedup/issues.
Just tried it now - doesn't seem to be any better speed wise. Just checking I got the correct repo:
Downloading GitHub repo hughjonesd/huxtable@master
from URL https://api.github.com/repos/hughjonesd/huxtable/zipball/master
That's right. Check with packageVersion
, though. It should end in 9000. Could you report the results of:
microbenchmark::microbenchmark(as_Workbook(as_hux(matrix(rnorm(1000), 10, 100))), times = 10)
with github master and with 4.1.0?
Code is churning a bit so wait
Try now.
Before:
Unit: seconds
expr min lq mean median uq
as_Workbook(as_hux(matrix(rnorm(1000), 10, 100))) 1.619138 1.688668 1.75137 1.719496 1.789264
max neval
2.068814 10
After:
Unit: seconds
expr min lq mean median
as_Workbook(as_hux(matrix(rnorm(1000), 10, 100))) 1.579386 1.588832 1.656722 1.621283
uq max neval
1.673724 1.934853 10
Odd. Mine are:
Before:
expr min lq mean median uq
as_Workbook(as_hux(matrix(rnorm(1000), 100, 10))) 4.473533 4.707627 5.235132 4.940469 5.200724
max neval
7.803099 10
After:
expr min lq mean median uq
as_Workbook(as_hux(matrix(rnorm(1000), 100, 10))) 1.566315 1.678421 1.882421 1.725201 2.092803
max neval
2.378984 10
On that basis I'm going to keep the changes. You may find they help you with your many-columns case also.
Had a good look at the code this morning having returned from holiday. The main reason for the slowness will be the for next loops going over each cell - so the result will be code which is difficult to scale.
If we vectorise a few of the functions, we should be able to get the code just to operate on the entire huxtable which should offer significant speedups.
Will have a look at some code chunks and see if I can help with that and I'll push some changes as PRs to you.
You're right in general - this is what I have been doing with to_latex and to_html. But bear in mind that huxtables can have formatting etc. in arbitrary cells, while openxlsx::createStyle
takes only a vector of rows and columns to specify a rectangle. The ideal strategy would be to identify rectangles with similar styles, but this could be quite messy. My current change just memoises createStyle
.
A quicker win might be to figure out the difference between openxlsx::writeData
with numeric and character data. The ideal would be to call writeData
once with the whole huxtable, then apply different styling to cells where is_a_number
returns TRUE. This would be simpler than the current rather ugly kludge, and probably faster too.
That's a good idea.
clean_contents could be vectorised more easily which would save a fair chunk of time. Format_numbers isn't far off that - it should eventually be callable with purrr::map2 or something like that rather than the double for loop.
Can't quite follow what na_string is doing in this context - and whether that can be merged into format_numbers (ie. we already detect NAs so act on them differently). I thought I'd got rid of it but then one of the unit tests failed so haven't been able to do so stably :-(.
I've uploaded a PR to change the nested if in format_numbers() to use a set of small generic functions which looks like it helps a bit (also is neater). I suspect we could fold all of format_numbers into the generic functions and do away with the function altogether - but haven't managed to do that in a way which passes all the unit tests yet (stupid error or failing to factor something in I'm sure).
na_string is simply a property, see ?na_string. It allows users to replace NAs in the data with something appropriate: "--" or "N/A".
So could we use the replace function over the whole lot rather than checking each one individually?
Yes! na_string(ht)
will give a matrix of strings for the whole huxtable.
By the way, you might want to take a look at TODO.md. It has most plans and/or ideas - some should probably be issues.
Just tried vectorising the first half of clean_contents but I'm triggering a number of unit test fails - not quite sure what is failing.
The code insert is as follows (with the intent of losing the double for loop):
contents <- as.matrix(as.data.frame(ht))
contents <- mapply(format_numbers,
string = contents,
num_fmt = number_format(ht))
for (col in seq_len(ncol(contents))) {
for (row in seq_len(nrow(contents))) {
cell <- contents[row, col]
# num_fmt <- number_format(ht)[[row, col]] # a list element, double brackets
# if (! is.na(cell)) cell <- format_numbers(cell, num_fmt)
The test fails mainly look like:
test-attributes.R:11: warning: Cell property attr top_border examples unchanged
first element used of 'length.out' argument
Any thoughts?
Made a further PR: I haven't managed to completely remove those loops (I am sure it's doable but I keep breaking things when I change the remaining lines).
Haven't written down the test results as I'm using my iMac this evening but it has definitely had a non-trivial impact on the benchmark test.
Closing.
Hi,
Sorry - another really simple suggestion for huxtable.
If the output is going to take more than, say 2s, would it be possible to introduce a progress indicator (even one which is optionally disabled)?
Jim Hester's ReadR is a nice example of how to make one kick in if the process time will be long.