statdivlab / radEmu

Other
43 stars 8 forks source link

Analysis 'stalls' - running for 7 hours #69

Closed sminot closed 1 month ago

sminot commented 1 month ago

This may not be a bug report as much as a request for guidance on how to best run the tool.

I've attached the dataset and analysis so that you can reproduce on your end. It was run inside of a Docker image which was built with the most recent commit of radEmu (https://quay.io/repository/hdc-workflows/rademu?tab=tags&tag=8235cf0). The repo with the Dockerfile used to build it is here: https://github.com/FredHutch/docker-radEmu. log.txt metadata.csv taxonomy.csv counts.csv

The script with all of the commands that I ran is in run.txt [run.txt](https://github.com/user-attachments/files/16241324/run.txt) . It references the input files counts.csv, metadata.csv, and taxonomy.csv. The logs produced thus far are in log.txt.

One aspect of this dataset which may the issue is that there is a large number of features (species) in the analysis (n=683), which outnumbers the observations (n=385). If you think that this is the core issue, I could try doing more aggressive filtering of the features.

sminot commented 1 month ago

Quick update on this ticket -- when I checked this morning I saw that it was still stalled.

Since the extremely large number of features in the analysis relative to the number of observations seemed like it might be an issue, I'm trying the same run again but with an upstream filter set to exclude any features with > 90% zeros.

svteichman commented 1 month ago

Hi @sminot,

I've started to take a look at this example. I added the verbose = TRUE flag which lets me see some of the information about the estimation and score test algorithms as they run. I've had it running on my laptop, and while estimation doesn't take too long (I think ~10-20 minutes), it took over 5 hours for the first score test to finish running (I updated some of the convergence criteria to be slightly less conservative than default, so it's possible this first one would've taken much longer with the default tolerances).

It makes sense to me that with a dataset this size, each score test takes a long time to run. I don't think the code is necessarily stalled, but I agree that it is infeasible to run when it is this slow, especially with ~600 tests. I think that the high level of sparsity in the dataset could be slowing it down, so I imagine that the upstream filter will speed it up both because there will be fewer features each time and the dataset will be less sparse.

@adw96 While I can't fully rule out some kind of bug in one of the algorithms, I think this could just be an example in which estimation under the null for the score test is particularly slow to converge, which I also saw in the HDW analysis with a high level of sparsity (in that analysis, we had ~400 taxa and ~60 samples and the mean time for a score test to run was ~2 hours).

@sminot this is not immediately helpful, but we have a project in the works to approximate the score tests, which will make an analysis like this much faster to run.

svteichman commented 1 month ago

One thing that I can do is tinker with the hyperparameters for the estimation under the null algorithm to see if I can find a combination that will make the score tests run faster, although I suspect that this will only have a moderate effect on the overall speed of the tests.

sminot commented 1 month ago

The impression I get from your inspection of this dataset is that the sparsity of the measurements plays a strong role as well as the number of features involved. For my part I'm thinking it might be worth starting my tests from the other end of the stringency filtering spectrum, starting with a very conservative filter that e.g. excludes anything with >75% zeros, and then possibly relaxing the filter from there after seeing how things run.

For the project to approximate the score tests, if you think that would be available before publication I'd be more than happy to repeat the analysis at that point.

svteichman commented 1 month ago

The number of features has the largest effect on run time, because each score test fits the full radEmu model under the null hypothesis, and the radEmu model has $p\times J$ parameters, where $p$ is the number of columns in the design matrix (in your case 2) and $J$ is the number of features. The more parameters to fit the longer the test will take to run in general. My intuition on how sparsity affects the runtime is a little shakier, I've noticed that higher sparsity with datasets that take longer to run, but I don't have a great understanding of the mechanism behind that or how true it is in general beyond the few cases where I've observed it.

Our faster approximation of the radEmu score tests can be found here, and I'm happy to set up a script that replicates your run.txt file but with fastEmu instead of radEmu. However, the major difference between these methods is that while radEmu by default compares log fold change parameters to the approximate median log fold change over all categories, fastEmu needs a small subset of categories (in your case features) to compare to in order to fit a smaller and faster model. For this we can either use a random subset of features (with the hope that the median differential abundance over that subset will approximate the median differential abundance over the whole set of features) or choose a subset that you expect to vary less in differential abundance. For example, when running a differential abundance analysis for genes, we compare to the set of genes that encode ribosomal proteins, with the idea that this set will have differential abundances that vary less than genes with arbitrary functions.

adw96 commented 1 month ago

Hi @sminot -- thanks again for letting us know about this, and I agree with @svteichman 's great insights here. Did we fully address your issue/question? I'd love to close if so.

sminot commented 1 month ago

Yes, I think so! Thank you for following up.