waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
69 stars 32 forks source link

optimization of .harmonize #297

Closed cvanderaa closed 3 years ago

cvanderaa commented 3 years ago

Hello Marcel, @LiNk-NY

I'm working with a MultiAssayExperiment object that contains many assays and I realized that any call to experiments<- always takes a lot of time, while working on ExperimentList object is very fast. I saw that a lot of the computational time is eaten up by .harmonize. I here have a mock example that illustrates this, where an assay is replaced by the same assay on the data set I am working on:

## Get the data set
BiocManager::install("scpdata")
library(scpdata)
mae <- specht2019v3()
mae <- as(mae, "MultiAssayExperiment")

## Test
library(microbenchmark)
el <- experiments(mae)
microbenchmark(experiments(mae)[[1]] <- experiments(mae)[[1]],
               el[[1]] <- experiments(mae)[[1]],
               MultiAssayExperiment:::.harmonize(el,
                                                 colData(mae),
                                                 sampleMap(mae)),
               times = 1)
Unit: milliseconds
                                                                expr         min          lq        mean      median          uq         max neval
                      experiments(mae)[[1]] <- experiments(mae)[[1]] 18337.06081 18337.06081 18337.06081 18337.06081 18337.06081 18337.06081     1
                                    el[[1]] <- experiments(mae)[[1]]    65.29469    65.29469    65.29469    65.29469    65.29469    65.29469     1
 MultiAssayExperiment:::.harmonize(el, colData(mae), sampleMap(mae)) 18547.99283 18547.99283 18547.99283 18547.99283 18547.99283 18547.99283     1

This is probably not the best way to assess this, but the point is that .harmonize is taking a dramatic proportion of the time. After some profiling of the code, I saw that the mendoapply chunk is taking around 85% of the workload.

In this PR I suggest a small modification that improves the speed of .harmonize by 3 or 4x. I however can't run your tests because I can't figure out how to get the example data within the tests... I could successfully run the harmonization tests manually.

Would this be useful to you? Do you see another way to optimize the speed even more, without loss of consistency in the data? I think some steps of the harmonization could be skipped if .harmonize is not called from within a subsetting function, but I'm not sure about that. I would love to have your input on this.

cvanderaa commented 3 years ago

The PR does not pass CI, but it seems the errors are not related to my changes...

LiNk-NY commented 3 years ago

Hi Chris, @cvanderaa

Thanks for looking into this. I will take a closer look later today. The first and last operations in the microbenchmark test are both using .harmonize and the middle one is faster because it does no checking and could lead to the wrong answer if you replace the list element with an arbritary value, e.g.,

el[[1]] <- matrix()

Best, Marcel

cvanderaa commented 3 years ago

The first and last operations in the microbenchmark test are both using .harmonize and the middle one is faster because it does no checking

Exactly, that's what I meant! You can see that when using experiments<-, over 90% of the time is spent in .harmonize, and therefore I was willing to reduce that time as much as possible. I think I found a shortcut in the algorithm to spare some time without losing data validity, and this is what I suggest in this PR.

Thanks a lot for looking into this!

LiNk-NY commented 3 years ago

Hi Chris, @cvanderaa I had a closer look at the PR and unfortunately, it is giving the wrong answer. subset_assay will return only the "unchanged" experiments rather than the complete ExperimentList. I did add a small performance increase on the master branch: 1309679 I will close it for now. Thanks! Best, Marcel

cvanderaa commented 3 years ago

Oh sorry for the bug, but thanks for the commit!