nvpatin / MAMBO

A program to analyze eDNA metabarcoding sequence data from multiple trophic levels to identify drivers and patterns of biodiversity. This program treats read counts as draws from a beta distribution and uses a Bayesian regression model to link principal components of distinct data sets. Inspired by work at the 2023 NOAA/NCAR Hackathon.
MIT License
3 stars 1 forks source link

Run time very long for large data set #2

Open nvpatin opened 6 months ago

nvpatin commented 6 months ago

I tested a larger 16S/18S data set and it was taking a very long time to run even the first PCA replicate (hadn't finished after about an hour). The original 16S data set was 7K observations of 131 variables and 18S was 10K observations, while this new set was about 67K observations with 473 variables (25K observations for 18S).

Is there any way to add a multithreading option at least for the PCA calculations?

EricArcher commented 6 months ago

For clarity: the PCA itself took over an hour to run? The base PCA function (prcomp()) doesn't have multithreading capabilities. I'm not certain there are parallelizable components of a PCA. If this is really an issue, we may have to find another solution like subsampling for the PCA. How was the data set larger? More ASVs, samples, or both?

nvpatin commented 6 months ago

I tried again and took a closer look; it seems like it's the Bayesian model that is actually taking a long time, not the PCA itself (see attached screnshot). Screenshot 2024-03-29 at 1 40 43 PM

The larger data sets had more of both: 67K ASVs in 473 samples for 16S (vs 7K ASVs in 131 samples) and 25K ASVs vs 10K ASVs for 18S.

EricArcher commented 6 months ago

Phew! That's expected. I'm working on a different project, parts of which have similar Bayesian model components to this one. They both use the Bernoulli (0/1) switches which I couldn't find a way to code in STAN which is considerably faster. Luckily, I've found someone who knows how to do it and he'll be walking me through that soon. Once I learn how to do that, I'll recode this in STAN as well and it should make a big difference.

nvpatin commented 6 months ago

Sounds good! I'll leave this issue open for now.