magland / ml_ms4alg

MountainSort v4
7 stars 19 forks source link

Why do a PCA before `cluster`? #27

Closed cboulay closed 4 years ago

cboulay commented 4 years ago

In compute_event_features_from_timeseries_model, the principal components are calculated from a subset of clips (in time-amplitude space) then the component weightings are applied to all clips, resulting in features.

Then, in branch_cluster, these pc features are passed to cluster, which once again does PCA on a subset of data, then applies the weights to all feature-vectors, resulting in pc'd pc features. (PPC? superintendent components?)

I understand why the PCA happens inside cluster -- each successively deep branch is operating only on the events sharing a label assigned in the previous branch, so we're dealing with a different subspace. But I don't understand why the outer PCA happens. Is there any benefit to doing a PCA on data in time-amplitude space, only to then do another PCA on these features? Shouldn't the PCA inside cluster effectively do everything the first outer PCA is intended to do?

magland commented 4 years ago

Ah. I think this relates to the other question you asked: https://github.com/magland/ml_ms4alg/issues/26

And now I think the the doubling at that point was intentional (but perhaps not necessary). In the end, the clustering is done with num_features as intended. So that's a relief. BUT the question still remains, why do I do PC and then PC (PPC as you say)? I think this is for convenience in programming. The branch cluster function takes in vectors whereas the other function operates on the timeseries model. So it's convenient to do it the way I have. But I think the *2 is still unnecessary... and the equivalent result would be obtained without it. But as I said, in the end num_features is being used (not doubled).

cboulay commented 4 years ago

I think this is for convenience in programming.

OK that's what I suspected. But I was worried that maybe I was missing something important regarding first calculating PCA on one subsample then calculating PCA on that, but using a different subsample, that had some theoretical basis for being more robust than just doing a single PCA.

I'm working on Utah Array data (400 um interelectrode spacing) so I don't need the neighbourhood adjustments. I am just trying to get a firm grasp on the branched isosplit5.

magland commented 4 years ago

Great, thanks for digging into the code and asking!