I find the "Statistical inference" tutorial extremely hard to follow

hoechenberger commented 2 years ago

https://mne.tools/stable/auto_tutorials/stats-sensor-space/10_background_stats.html

Don't get me wrong, the content itself is great, but to me, the "important bits" get totally hidden behind a wall of code blocks that do complex visualizations and fake data generation, so anytime I look at this thing, I'm struggling to find the actually relevant parts – those that show me how to do statistics! Meaning, at the end of the day, the tutorial isn't very helpful at all, and this is a pity!

And I cannot relate to the data being used at all.

For example, this is the data:

width = 40
n_subjects = 10
signal_mean = 100
signal_sd = 100
noise_sd = 0.01
gaussian_sd = 5
sigma = 1e-3  # sigma for the "hat" method
n_permutations = 'all'  # run an exact test
n_src = width * width

# For each "subject", make a smoothed noisy signal with a centered peak
rng = np.random.RandomState(2)
X = noise_sd * rng.randn(n_subjects, width, width)
# Add a signal at the center
X[:, width // 2, width // 2] = signal_mean + rng.randn(n_subjects) * signal_sd
# Spatially smooth with a 2D Gaussian kernel
size = width // 2 - 1
gaussian = np.exp(-(np.arange(-size, size + 1) ** 2 / float(gaussian_sd ** 2)))
for si in range(X.shape[0]):
    for ri in range(X.shape[1]):
        X[si, ri, :] = np.convolve(X[si, ri, :], gaussian, 'same')
    for ci in range(X.shape[2]):
        X[si, :, ci] = np.convolve(X[si, :, ci], gaussian, 'same')

What does that even mean? I can't remember the last time I used np.convolve manually, and how's this blurry thing we're creating in any way related to the neurophysiological recordings I'm wanting to analyze?

If anybody has any ideas on how to make this tutorial more approachable to ordinary users, it would be greatly appreciated!

cc @sappelhoff

sappelhoff commented 2 years ago

and how's this blurry thing we're creating in any way related to the neurophysiological recordings I'm wanting to analyze?

True, I have tripped over that as well.

cbrnr commented 2 years ago

Could we collapse code like the one you show by default? That way, non-essential code (which is needed to create toy data etc. but not to do statistics) would be hidden and people could focus on the essential things.

hoechenberger commented 2 years ago

Could we collapse code like the one you show by default? That way, non-essential code (which is needed to create toy data etc. but not to do statistics) would be hidden and people could focus on the essential things.

I had thought about this too, but I'm not sure if it would really help in this particular case, as the toy data for example is to me in no way related to electrophysiological data, so hiding its generation would maybe make things even worse … I wonder if a first step could be to pick a different set of / different approach to generating the example data.

Could we maybe not simply load sample and add some random offsets or something to generate "participants"? Something like that. Electrophysiological time series data. And ideally no data that has a square shape, because then I'm always getting confused about which dimension is which, esp. once we extract the data as a NumPy array (we should offer xarray support so dimensions are properly labeled, but this is another discussion)

cbrnr commented 2 years ago

I actually really like this tutorial. It uses abstract data to make the points for sure, but it allows us to visually show what we're talking about. With real data this would not be that straightforward (if at all visible). Maybe we could add to the main text the correspondence of our toy data to real (EEG) data? That way, people could make the connection from the abstract toy example to their own data?

hoechenberger commented 2 years ago

Maybe we could add to the main text the correspondence of our toy data to real (EEG) data? That way, people could make the connection from the abstract toy example to their own data?

I'm not sure I understand what you mean exactly, could you please give an example?

cbrnr commented 2 years ago

Maybe something like: the blob visualized in 2D corresponds to EEG channels? I don't even know if that's true TBH, but something along those lines.

larsoner commented 2 years ago

And I cannot relate to the data being used at all.

To me this is a narrative problem, we should describe why the data are created the way that they are -- what it accomplishes, and how it relates to real data. I think it can be done in a couple of sentences.

I wonder if a first step could be to pick a different set of / different approach to generating the example data.

I would rather not. To me this is actually the simplest example that actually demonstrates the core ideas. It's also taken (almost?) directly from a paper IIRC, which should be cited in the example already (we should add it if it's not).

Could we maybe not simply load sample and add some random offsets or something to generate "participants"? Something like that. Electrophysiological time series data.

Anything using real data will actually end up comparatively more complicated and less clear I think.

And ideally no data that has a square shape, because then I'm always getting confused about which dimension is which, esp. once we extract the data as a NumPy array (we should offer xarray support so dimensions are properly labeled, but this is another discussion)

The idea of the example is to provide an abstraction -- the X and Y dimensions could really be anything (time, space, frequency, etc.). The principles generalize to real data along any dimensions once you understand the idea. I think we need to convey this part more clearly

mne-tools / mne-python

I find the "Statistical inference" tutorial extremely hard to follow #10744