Closed mronkko closed 7 months ago
This is quite the extreme application. Personally, I think the solution should be based on a slight re-write of the logic instead. Rather than reading in all the files at once as the current version does, which of course is memory intensive, only one/two files need to be read in at a time for the purpose of aggregation as rolling averages and summations of information can be used instead. This would have slight performance hits in comparison to simply reading all the files in at once, but at least then RAM doesn't become a factor anymore.
What you are proposing as a solution is what I suggested in #35. However, summarizing before aggregating is not always ideal. For example, if outlier estimates are a concern, it is useful to have the raw estimates at hand when analyzing the results. I think both selectively reading designs or statistics and summarizing between aggregation would be useful features.
Yes, my use case is extreme. But it is something that can happen when answering reviewer comments: Let's say that the reviewer wants to see two additional conditions in the simulation. Running this as a fractional factorial is not a desirable option because the reviewer might be used to seeing full factorial designs and be skeptical of fractional factorial designs. Accommodating a reviewer request thus means that the size of the simulation increases at least 9 (3 * 3) fold. Also, because one might not be sure what statistics the reviewer want to see, it is sometimes better to err on the side of caution and save a lot of things from each simulation run. 300GB is extreme, but tens of GB has occured multiple times in my work.
For example, if outlier estimates are a concern, it is useful to have the raw estimates at hand when analyzing the results. I think both selectively reading designs or statistics and summarizing between aggregation would be useful features.
As a design principle if outliers are a concern then they should be addressed much earlier in the workflow than at the end of a simulation experiment. Either increase the number of replications to remove the effect of outlying observations, or provide better flags to throw out such outliers early on in the Analyse()
step. Easier said than done, I know, but having to sift through exorbitant replicates in search for unusual cases post-hoc is inherently cumbersome and highly error prone.
Accommodating a reviewer request thus means that the size of the simulation increases at least 9 (3 * 3) fold. Also, because one might not be sure what statistics the reviewer want to see, it is sometimes better to err on the side of caution and save a lot of things from each simulation run. 300GB is extreme, but tens of GB has occured multiple times in my work.
An 8 fold increase would be problematic from a practical perspective (I assume the initial simulation would be a subset of the added condition combinations), though I'm still looking at this problem as something the simulation designer will have to deal with on a per simulation basis. I was initially thinking about this in terms of a collapsing-across-distributed-replications perspective (a la issue #33), in which case even here the problem is almost entirely avoidable if one were to point to a subset of files that should collapse to only one design condition, and then repeating this across the $n$ sets of files to finalize the collapsing of the other design conditions until all design conditions are completed and saved to their own file. After that, more collapsing to a complete simulation design object would be the next logical step for post analysis.
I see what you are saying about storing the simulation results becoming a problem, but this is largely a hardware stress concern that comes with the decision to store complete sets of simulation results in the first place. Storing pre-summarised estimates should generally be a desirable feature of simulations, but not necessarily a strict requirement, so if storing results becomes unwieldy then summaries of the results should become the pragmatic target (lest terabytes of storage be used that are rarely if ever inspected). This could indeed be a problem if something is missed later on, of course, but at some point one has to pick their poison when projects grow too large.
As a design principle if outliers are a concern then they should be addressed much earlier in the workflow than at the end of a simulation experiment. Either increase the number of replications to remove the effect of outlying observations, or provide better flags to throw out such outliers early on in the Analyse() step. Easier said than done, I know, but having to sift through exorbitant replicates in search for unusual cases post-hoc is inherently cumbersome and highly error prone.
Ideally outliers would be indentified in the Analyse function, but as you say it is easier said than done. In practice, this boils down to spending ones time thinking about a good exclusion rule before running the simulation or spending computer time after running the simulation.
In my work, I have followed Boomsma's recommendation:
"Q-Q plots, along with Tukey’s box plots, can also efficaciously exhibit deviating tails and outliers in sampling distributions. Researchers should always testify their attention for the potential presence of outlying observations and nonsymmetric distributions; for instance, when means, variances or standard deviations, and correlations are used as descriptive performance statistics. When needed, robust statistical procedures or estimators should be applied (see, e.g., Wilcox, 2003), and reported accordingly. Gold and Bentler (2000) analyzed generated data with and without outliers. Fan, Thompson, and Wang (1999) checked for disturbing effects of outliers, and exemplary reported: “we did not notice any conspicuous outliers, so all sample estimates were included in the calculation except nonconvergent samples and samples with improper solutions” (p. 68)."
Boomsma, A. (2013). Reporting Monte Carlo studies in structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 20(3), 518–540. https://doi.org/10.1080/10705511.2013.797839
The box plot criterion is easy to automate. For example, here is what we did in one simulation study "Then, following Boomsma (2013), we used the box-plot method to identify outliers in the interaction effect estimates. Proceeding with one estimator-design combination at a time, we flagged all estimates that were more than three interquartile ranges below the first quartile or above the third quartile as outliers. We then deleted all replications that contained at least one outlier estimate, as in the case of non-convergent replications."
Lonati, S., Rönkkö, M., & Antonakis, J. (2024). Specification tests for distributional misspecifications in latent interaction models. Psychological Methods, forthcoming.
Whether to store summaries, a subset of estimates, or everything possible is also a question of money. The electricity costs for a large simulation on a cluster can be substantial. For example, the largest simulation that I have done was 6 core years. Assuming 10W per core, this translates to more than 500 kWh of electricity just for the cores. When you include cooling and other costs, the cost of one computer run can easily exceed the cost of a few terabyte hard drive. In this case, it might make sense to just store everything that you might possibly need in the future to avoid rerunning the expensive simulation again.
Added select
to aggregate_simulation()
, though this is currently only allowed to be a character vector (the lazy evaluation syntax of dplyr::select() is convoluted to work with in secondary functions, though this isn't really a problem given grep()
functions on the SimExtract(, what='results')
column names). For the subsetting portions, pointing to the relevant filenames in isolation should suffice.
aggregate_simulations
work currently by aggregating all results from many results files. This may not work well with very large simulations. For example, I have a simulation that produced around 300GB of replication data. (In this case, computation was expensive and therefore it made sense to save a lot of different statistics from each replication.) Loading these results withaggregate_simulations
would not work because the computer runs out of memory.I propose adding three more arguments to
aggregate_simulations
to handle this kind of scenario. Thefilter
andselect
arguments, modeled after the dplyr package, would allow the users to choose which conditions and which result variables to load. This would be useful, for example, when producing figures or tables that each need just a subset of the full simulation results.