vferat / pycrostates

https://pycrostates.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
37 stars 11 forks source link

Segmentation statistics #106

Open Mattlab-hub opened 1 year ago

Mattlab-hub commented 1 year ago

Hello, I have been playing around with this toolbox and really appreciate its segmentation metrics to aid in finding the right number of states to choose. However, one enhancement that would be truly helpful would be something that tells you what your longest and shortest states are after segmentation. The current metrics do not account for segmentation, only the ModK. Thus the different choices on smoothing variables still need to be accounted for along with the number of states chosen. Given the literature we should be aiming for 60ms - 150ms state segmentation lengths or at least something that follows along the suggested literature from Lehmann et al., (1987) to Michel & Koenig (2018). For my use I threw together a quick pipeline that reads in the segmentation output from pycrostates and outputs the longest state, shortest state, mean state, and standard deviation of state lengths--all in milliseconds. I have it here if you felt like integrating it: https://github.com/Winter-LabHub/Microstate-min-and-max-length

I also have a permutation pipeline that may be helpful in determining some of the other statistics of state segmentation: https://github.com/Winter-LabHub/two_sample_Permutation_ttest

vferat commented 1 year ago

Hey @Winter-LabHub,

Thanks a lot for your feedback !

Using the min/max segment durations could be an interesting metric to validate the number of cluster centers. However, the data used for clustering are not necessarily ordered in time (this is even rarely the case because gfp peaks are often used, and/or data from different subjects are mixed). In addition, as small period of noise would completely mess-up those metrics. That's why I prefer not to implement it as a metric in the library.

However, we can easily obtain these metrics by calculating the segmentation of the fitted data with cluster.predict(inst) and then calculate microstates parameters with the option return_dist=True. As described in this tutorial you will have access to each microstate duration distribution, from which you can easily extract min and max durations.

Note that theses distributions are heavly impacted by the temporal smoothing parameters used in cluster.predict. (keeping default values will avoid smoothing).

A minimal example would look like this

ModK = ModKMeans(n_clusters=5, random_state=42)
ModK.fit(raw)
segmentation = ModK.predict(raw)
parameters = segmentation.compute_parameters(return_dist=True)

for cluster_name in cluster_names:
    min = np.min(parameters[f'{cluster_names}_dist_durs'])
    print(cluster_names, min)

What do you think of this workaround ?

Mattlab-hub commented 1 year ago

Hi @vferat,

I see what you are saying, but having a quick output of min, mean, std, and max might be beneficial for the process of figuring out what combination of smoothing parameters works best, onto of some of the other statistics your toolbox already provides. I am working on MEG data so I am having to play around a bit with the parameters and go back trying a different number of states and try different combinations of smoothing parameters again and again to see what gives me something that start looking like the literature. The code you provide works as well, I just figured future users might want a quick output so they can quickly try a bunch of different options with minimal code. The code I wrote was particularly for epoched data on the segmentation.labels. It can easily be flattened since it is [epochs, time_points]. I calculate the amount of time_points (indices) that are in a millisecond based on the sampling rate, and after I flatten the segmentation.labels then I can easily extract all those statistics. But it too is a chunk of code, so it would be more convenient for your code or mine to be integrated into the toolbox as a singular function for users going through this exploratory phase of state numbers and smoothing parameter settings.