Heatmap raster output options

jokergoo / ComplexHeatmap

Make Complex Heatmaps

https://jokergoo.github.io/ComplexHeatmap-reference/book/

Other

1.31k stars 230 forks source link

Heatmap raster output options #688

Closed jmw86069 closed 3 years ago

jmw86069 commented 3 years ago

I have been using EnrichedHeatmap and ComplexHeatmap quite a bit -- love them both, A+ fantastic!

I have been hitting similar issues as reported for rasterization (for example the tiny white lines; very long render time with large data/panels). Imo this is by far the best heatmap package in R, so I hope this issue does not sound like a complaint at all! If anything, I'm hoping there is some setting or update that might improve my use cases. And I'm willing to do some testing/coding if it helps.

I suspect you already have experience here and may have better ideas, or comments about why certain options are not viable.

My use case has around 30+ heatmaps with 500 columns and 20,000 rows each. To save one PDF file takes about 45 minutes, produces a file whose size depends upon raster settings, between about 100Mb and 900Mb. I have tested the rasterization without magick, with magick, with various settings for raster quality (non-magick) and different magick settings. (I have not tested extensively since #682, but after a few tests the output seemed same as before.)

It feels like it shouldn't take 45 minutes - but to be fair this is still faster than a lot of other options, so... :)

It also seems that output file size is a bit large - maybe the saved resolution is much higher than needed.

I also tested grid.raster(..., interpolate=TRUE) as a drop-in replacement for grid.rect() in Heatmap-draw_component.R and found the results visually quite good (see image below). Unfortunately this test produces the largest file size (900 Mb), but also the highest quality heatmap. I'm investigating what is happening, but think PDF output causes the raster x/y dimensions to be abnormally high -- essentially storing the full raster inside each heatmap panel.

Anyway the image on the right shows substantially more detail than the left -- which motivates me to prefer any option that preserves the details on the right.

My questions:

Did you decide against grid.raster(..., interpolate=TRUE) for similar reasons as I observed?
Do you think interpolate=TRUE may be beneficial for your "magick" workflow?

Below on left is with "magick" rasterization, right is grid.raster(..., interpolate=TRUE)`, showing two of the 30+ heatmap panels.

jokergoo commented 3 years ago

Thanks for the post! To be honest, I haven't tested on such huge matrices. All my decision on the rasterization parameters are based on my "smaller matrices". Also, 100M file size sounds too big...

Also for grid.raster(..., interpolate=TRUE), sorry I haven't tested that, so it is not because I am against it :)

I will make more tests and let you known once I have some results.

Do you still see white lines with recent versions of ComplexHeatmap? These also come from some rasterization methods. I thought I have fixed it...

And, would mind to share one of your matrix so that I can directly test on that?

jokergoo commented 3 years ago

And of course, the right figure is better than the left one!

jmw86069 commented 3 years ago

Thanks much for the interest! I am putting together a test set that reproduces the images above, and am making sure I know what steps produce which result. ;)

jmw86069 commented 3 years ago

Okay I'll try to attach the file and what code I'm using to test.

load("testmatrices_17feb2021.RData")
ehm <- EnrichedHeatmap::EnrichedHeatmap(nmatrix,
   name="diff_1",
   col=col_hm,
   pos_line=FALSE,
   use_raster=TRUE,
   raster_by_magick=TRUE,
   split=isplit)

ehm2 <- EnrichedHeatmap::EnrichedHeatmap(nmatrix2,
   name="diff_2",
   col=col_hm,
   pos_line=FALSE,
   use_raster=TRUE,
   raster_by_magick=TRUE,
   split=isplit)

cairo_pdf("test_ehm_raster_magick_17feb2021.pdf",
   height=10, width=3.5, onefile=TRUE)
system.time(draw(ehm + ehm2)) # 66 seconds for two panels
system.time(dev.off()) # 0.03 seconds (no time)
# 229 Kb file

Then another with raster_by_magick=FALSE. Then another with use_raster=FALSE - the file is 120 Mb. Then another with my grid.raster(..., interpolate=TRUE) variation... I coded it to run in the use_raster=FALSE workflow. I think it is rendering every pixel as a raster, then the Adobe viewer is making it look sharp.

Hmmm... Github will only accept a file 10 Mb or less... Let me find somewhere to park it and post a link. Sorry first time posting big file to an issue!

jmw86069 commented 3 years ago

Okay let's see if this works! https://www.dropbox.com/s/v569pg0s7h3mt2r/testmatrices_17feb2021.RData?dl=0

This should link to one file "testmatrices_17feb2021.RData" which contains a few objects referenced above: "nmatrix", "nmatrix2", "col_hm", "isplit", "maroon_gold"

Let me know if this does not work and I'll try another method. And of course no rush! :)

jmw86069 commented 3 years ago

I have been doing a bit more testing of methods, outputs, details, etc.

The grid.raster(..., interpolate=TRUE) when using PDF output appears to store every pixel in the raster file. It is unclear to me how it determines the output raster resolution, buried inside some C code in grid.

When using PNG output, the resolution seems to honor the res argument. So with PNG output, I can set a sufficiently high res=600 and it produces a much larger and much more detailed PNG output file; the default res=72 (for most systems) produces a smaller and therefore less detailed output.

So using grid.raster(..., interpolate=TRUE) does not seem to let us define what output resolution should be.

One alternative I am testing now is raster::resample() which does bilinear interpolation, and appears to be really fast. My guess is that this is an efficient alternative, then calling grid.raster(). At that point I don't know if it helps or hurts to use interpolate=TRUE - it should not matter.

Here is the working interpolation wrapper function I was testing:

#' Resize numeric matrix using raster resample
#' 
#' @param x `numeric` matrix input
#' @param ncol,nrow `integer` values to define output columns and rows
#' @param method `character` string with the `raster::resample()` method,
#'    where "bilinear" uses bilinear interpolation, and "ngb" uses
#'    nearest neighbor.
#' @param ... additional arguments are ignored
#' 
resize_by_raster <- function
(x,
 ncol,
 nrow,
 method=c("bilinear", "ngb"),
 ...)
{
   method <- match.arg(method)

   if (ncol(x) == ncol && nrow(x) == nrow) {
      return(x)
   }
   # extent boundary - required but has no effect on output
   ex <- extent(c(0, 20, 0, 20))
   r_target <- raster::raster(ncol=ncol, nrow=nrow, ext=ex)

   r_small <- raster::raster(ex,
      vals=x,
      nrow=nrow(x),
      ncol=ncol(x))
   r_out <- raster::resample(x=r_small,
      y=r_target,
      method=method)
   r_out_matrix <- as.array(r_out)[,,1];
   return(r_out_matrix)
}

The next challenge is determining the appropriate values for ncol and nrow - but this is similar to some of your existing code. Anyway, in my limited testing, this output seems to perform better than ImageMagick options, maybe because I can adjust the level of detail.

jokergoo commented 3 years ago

Hi, I am having a new laptop so now I can work on this issue :)

First, regarding to interpolate in grid.raster(), to my understanding from the paper "Raster Images in R Graphics" (https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Murrell.pdf), interpolate only works when the raster matrix is smaller than the area where the image is to be put. Let's denote the RGB matrix for the raster image as M1 and the area where the raster image is to be put as M2, if dimension of M1 is smaller than M2, every pixel in M1 will occupy multiple pixels in M2, in this case, interpolation can be applied so that rasterization to M2 can be more smooth.

Following is the example from the paper I mentioned. In the example, the raster matrix is 10x10, and the area for putting the raster image is far larger than 10x10. On the left is with interpolate = FALSE and on the right is with interpolate = TRUE, so you can see the difference here:

The second example is also from that paper. Here the R logo has size of 76x100, as you can see, interpolation makes the raster image more smooth for the edges of the image. On the left is with interpolate = FALSE and on the right is with interpolate = TRUE.

The case is reversed in Heatmap() where M1 has larger dimension than M2, and in this case, rasterization actually does not work. See the test from your data. Left: interpolate = FALSE, right: interpolate = TRUE. There are no difference between the two figures.

Also as I have tested, the plot you marked as "before" is more like setting raster_by_magick = FALSE and the plot you marked as "after" is more like setting raster_by_magick = TRUE. See the test I've made:

For the package raster, I don't whether it outperforms package magick for some of the resizing methods. magick already has many resizing methods available. The resizing methods can be obtained via magick::filter_types() (explanation for the algorithm is here: https://legacy.imagemagick.org/Usage/filter/) and the value can be set with raster_magick_filter argument in Heatmap().

Code for testing:

cairo_pdf("test.pdf", height=10, width=4, onefile = TRUE)
Heatmap(nmatrix, row_order = order(enriched_score(nmatrix)),
    cluster_columns = FALSE, col = col_hm,
    show_row_names = FALSE, show_column_names = FALSE,
    use_raster = TRUE, raster_by_magick = FALSE,
    column_title = "raster_by_magick = FALSE")

Heatmap(nmatrix, row_order = order(enriched_score(nmatrix)),
    cluster_columns = FALSE, col = col_hm,
    show_row_names = FALSE, show_column_names = FALSE,
    use_raster = TRUE, raster_by_magick = TRUE,
    column_title = "raster_by_magick = TRUE")
dev.off()

jmw86069 commented 3 years ago

Thank you much for investigating! The post below is some of my findings -- mostly to say "I will post examples soon" and thank you for your patience and help!

I also have dug deeper, I think my initial test had some artifact as you pointed out. Best I could tell, my first image was rasterized using the non-Magick method, the second image was in fact not rasterized at all. That explains why the file size was so enormous - it was literally storing every cell. The improved appearance seemed to be the result of Adobe Acrobat rendering with its own internal smoothness.

Aside: I did not know grid.raster() only enlarges and never shrinks! Wow that's so wild, I appreciate you finding that! (I could never track down the specific code being called. Maybe somewhere deep in the source code of R itself.)

I have been using graphics::rasterImage(..., interpolate=TRUE) for years with base R plotting, specifically because does both shrink and enlarge. That was my whole motivation - for heatmap data with more rows than screen pixels (or print pixels). I had test sets showing before/after, the reduction in PDF file size, revealing hidden signal with interpolation that was lost otherwise, the whole bit. Before posting this issue here, I thought I compared rasterImage() with grid.raster(..., interpolate=TRUE) and thought at the time they were working the same way... maybe I was deceived by Adobe Acrobat? :) I need to revisit my tests. Since rasterImage() and grid.raster() got the interpolate=TRUE update at the same time, and in the same accompanying blog post, I really though they worked the same way. My mistake, no worries. It's a shame though, haha.

So... my newer testing: It took much effort/time to get ImageMagick with dev libraries installed on our server. (Surprisingly hard, not sure why) Only then could I install the magick R package and test your newer rasterization properly.

Good news, quality is much improved - it looks similar to your post. File size much lower, 2 Mb pdf files compared to 120 Mb previously. These results make sense now. I will post proper images for comparison, sorry bout the delay!

The one remaining "issue" is that it takes 40 minutes to produce one figure. Granted... it is a lot of data! Ha ha. Yes it has 35 coverage heatmaps (!), 20k rows, 500 columns each... so maybe this is just the way it is for time being. I don't mind 40 minutes for "known settings". But iterating changes... 40 minutes at a time... adjusting signal ranges, y-axis settings, color scales... then wait another 40 minutes... is not my favorite. (It's still far better than anything else. Much respect and thanks!)

So I subset to 10% rows for testing, takes about 4 minutes - this isn't too rough. I get close to ideal settings. Then using every row, I still need some adjustments, but honestly not too bad. I'm curious about the rate-limiting steps.

For kicks, I did a few tests with rasterImage() and for just direct plotting a 20k row 1000 column matrix, it renders in about 1 second. I will post reproducible test case (to confirm I remember correctly.) Thanks for your patience! When I thought rasterImage() was equivalent to grid.raster(), I got excited that maybe this step could have a big speed increase, and maybe the round-trip through magick was somehow the slow step? My other theory is that since I was testing with R-3.6.1 (for project reasons) - the grid units step is much slower than R 4.0.

jmw86069 commented 3 years ago

Okay I have been making coverage heatmaps off and on over the months, I can't believe my last comment was February! Sorry for taking so long! Ha ha.

My most recent workflow has reasonable speed, with the following data:

15,000 rows
400 columns
20 EnrichedHeatmaps
5 row splits defined upfront
row order defined upfront (manual call to EnrichedHeatmap::enriched_score())
use_raster=TRUE, raster_quality=1, raster_by_magick=TRUE - I don't think raster_quality is active here.

It takes about 9 minutes to create a PDF file width=36, height=12, and it seems to scale about linearly with number of rows, plotting 25,000 rows took about 20 minutes, file size was 2MB as expected. I can work with 9 minutes - I usually run a couple tests with 10% rows to get settings correct, then run the full set later.

Oh, and image quality is really beautiful. It's so configurable, and looks amazing.

The most important step was to install the "magick" R package - which was not easy on our managed linux servers for some reason. I accidentally ran the same script on a different server that didn't share the library path, magick silently failed to load, and the output took forEVER. I don't think I let it finish, I killed the process and changed servers. ;)

Thanks so much for all the updates! I'll watch the other issue you opened about making Heatmap() faster. I'm sure it's possible during the rasterization step - but probably not that easy! Plug it into Unreal Engine. ;)

I'll close this issue, and thank you again for your patience!

jokergoo commented 3 years ago

Thanks also for your comments! They are useful!