magland / ml_ms4alg

MountainSort v4
7 stars 19 forks source link

Clusters can be split in phase 2 (after reassignment) #12

Open tjd2002 opened 5 years ago

tjd2002 commented 5 years ago

The reassignment in phase 1 uses evidence from a first round of clustering to attempt to assign events to the channel where, if they were clustered correctly, their 'true' cluster's centroid would have its peak. Inevitably, this reassignment is imperfect, and we think it can result in a narrow class of 'split clusters'.

Consider the case of 2 partially overlapping clusters, K1 & K2, whose templates (average waveform, a.k.a. centroid) have peak amplitudes on different channels, C1 & C2, respectively. In phase 1, Isosplit will split these clusters along some hyperplane, and assign all events to their corresponding channel (irrespective of the peak amplitude of the individual event). Inevitably, there will be some error (hopefully small) in the reassignment. For example. some events that are truly from cluster K2 will end up (either by assignment or reassignment) on channel C1. So far so good.

If we were to repeat the clustering, on the same events, then we should end up making the same error, and the situation would be stable (i.e. the erroneous K2 events should get clustered in with K1 again).

However, the second round of isosplit in phase 2 proceeds on only the events assigned (or reassigned) to the central channel. Since this is done with a new set of input events in a new PC space, it is plausible that some of the erroneously assigned events will get separated out into a new cluster. In the case of contamination of K1 by K2, then we could end up with two very similar clusters each containing some of the spikes for the true K2: the 'main' cluster on C2, and a cluster of (probably very few) 'orphan' spikes clustered on C1.

During curation, this would look like K2 had been split, with one cluster containing the great majority of the spikes. This is something our users report seeing under MS4.

This is currently just a hypothesis for the splitting. I plan to address it by adding in a check at the end of phase 2, to see if any of the resulting clusters have their template peak on a channel other than the central channel of the neighborhood. We could also better diagnose the operation of MS4 if we saved the 'home' neighborhood for each cluster when combining clusters after phase 2: this is requested in a separate issue #11 )

cc @hrjoo

tjd2002 commented 5 years ago

[NB this is somewhat of a placeholder while we work to test this hypothesis with real data and additional diagnostics. @magland feel free to assign to me for now]