Project evaluation - Githubissues

Summary

Overall, this project was well-done. There is evidence of research into existing methods, as well as experimentation with novel approaches. The code is parameterized to make further experimentation and reuse possible.

The process for selecting the best outlier detection method seems interesting in principle but flawed in execution. I outline the problems and some approaches to deal with them below.

I have made various other suggestions and comments that I hope will be helpful for future projects.

Documentation

The README is well-organized. The main things missing are:

1) Instructions for installation. Attempting to follow the instructions from a clean Python environment fail, and users need to figure out how to install the package and/or dependencies. (pip install . works, so it's not hard. Just don't assume that your users know that.) 2) A clear description of what the outlier detection methods are. DVARS is well-understood, but IQR and Z-score are fairly general concepts, so a bit more description of how they are applied would be helpful.

The description of the method selection is a bit spare. It simply says:

The General Linear Method (GLM) is then applied with convolved hemodynamic response function as activation model. From the GLM model, Mean Root Sum of Squares (MRSS) is calculated, before and after removing outliers detected by each method. The method that shows the biggest reduction in MRSS is then selected as the method of choice.

This is one of the most substantial pieces of the project, but we are not told what minimizing MRSS is supposed to buy us and at what cost. The mean residual sum of squares is used to calculate the denominator of a t-statistic, but this is a per-voxel measure. Aggregating across voxels does not have an obvious interpretation, at least to me. Below, I have some further discussion about the effects I believe this will have.

It would also be nice to have a summary of the overall results. Is one detector consistently best, or does it seem essentially random? Do they largely agree on outliers, or are they pretty much non-overlapping?

Methods review

The outlier_detection_methods.md is a nice review. It would help to link to it from the README.

Script execution

The script runs as expected, taking several minutes. Verbosity flags are helpful for tracking progress, and the --show flag is a nice idea for getting a sense of the findings for each BOLD file.

I was unable to detect any visual difference between the pre- and post-removal slices that were plotted, so it's not entirely clear what I was supposed to be getting out of this visualization. Plots that show numeric differences, appropriately scaled, might serve better than side-by-side plots. Some artifact detection algorithms have shown the per-volume measure of interest as a line graph, with outlier volumes marked out with vertical lines or shaded regions (example), or with a threshold as a horizontal bar (example)

Detector implementations

DVARS: DVARS is implemented correctly, and the detector identifies both volumes when two volumes differ by 1.96 standard deviations above the mean DVARS metric. Bounds checks are used to ensure that indices do not hit fencepost errors.

Z-Score: The Z-score detector is a straightforward test for whole-volume mean intensities. Volumes with intensities more than two standard deviations away from the global mean are identified as outliers.

IQR detector: The IQR implementation is a little unusual. Unlike the Z-score, this is performed on individual voxels instead of whole volumes. A volume is identified as an outlier if 5% or more of its voxels were outliers. The properties of such a detector are not obvious to this reviewer, and no explanation is given for this approach instead of the IQR of volume mean intensities.

Evaluation/selection of detectors

This project attempted a quantitative evaluation of possible outliers. The approach was to select the method that most reduced the mean residual sum-of-squares when detected outliers were removed before fitting a GLM with an predicted hemodynamic time course. This was implemented in this code section:

# compare MRSS between volumes, original and filtered
MRSS = []
for vol, resid in [(X, E), (X_filtered, E_filt)]:
# Residual sum of squares
    RSS = np.sum(resid ** 2)
    # Degrees of freedom: n - no independent columns in X
    df = vol.shape[0] - npl.matrix_rank(vol)
    # Mean residual sum of squares
    MRSS.append(RSS / df)

and corresponds to:

$$ \mathrm{MRSS} = \frac1{n - rank(\mathbf X)}\sum{(\mathbf Y - \hat{\mathbf Y})^2} $$

This is the denominator of the t-statistic, summed over all voxels.

Using this approach in order to determine the "best" detector, there will be two tendencies: On net, minimizing this metric should prefer measures which increase the magnitude of t-statistics by suppressing the denominator. However, if most voxels do not correlate with the predicted time course, this may have the effect of simply preferring the detector that removes the most volumes. It is not obvious which of these tendencies will dominate (it seems likely to depend on the signal-to-noise ratio in the data), but neither seems good for detecting true outliers. Inflating t-statistics will increase the probability of false positives, while removing more volumes than necessary will reduce statistical power, increasing the probability of false negatives. (If there is a theoretical reason that minimizing MRSS balances these two tendencies, it is not presented here.)

Leaving aside the question of whether this model selection method will achieve the desired goal, the approach of determining it on a per-run basis is prone to over-fitting. The appropriate method is to split the data into a training and validation set. The training set may be used to determine the detector that best reduces MRSS; once selected, the validation set may be used to determine the reduction of MRSS on unseen data. This avoids the double-dipping problem and allows you to estimate how well you expect your method to do on further unseen data.

Cross-validation is a more systematic way of using the training set to estimate the expected performance on the validation set before selecting the final method. With cross-validation, models (here, detectors) and parameterizations of those models (here, thresholds) can be included in the loop and evaluated.

Code style

The code is well-documented, with docstrings for all methods.

There are at times excessive comments, and dead code commented out instead of removed. While a well-placed comment around non-obvious code or before a large block can increase clarity or help guide reading, frequent comments around straightforward code can be distracting. For commented-out code, it is often better to rely on version control to keep track of removed code than comments.

In some cases an auto-formatter like black or cercis might help improve readability with little effort. Typos may be fixed with tools like codespell, which can also improve readability.

nipraxis-summer-2023 / diagnostics-bold_but_better

Project evaluation #32