Closed cboulay closed 4 years ago
Ah. I think this relates to the other question you asked: https://github.com/magland/ml_ms4alg/issues/26
And now I think the the doubling at that point was intentional (but perhaps not necessary). In the end, the clustering is done with num_features as intended. So that's a relief. BUT the question still remains, why do I do PC and then PC (PPC as you say)? I think this is for convenience in programming. The branch cluster function takes in vectors whereas the other function operates on the timeseries model. So it's convenient to do it the way I have. But I think the *2 is still unnecessary... and the equivalent result would be obtained without it. But as I said, in the end num_features is being used (not doubled).
I think this is for convenience in programming.
OK that's what I suspected. But I was worried that maybe I was missing something important regarding first calculating PCA on one subsample then calculating PCA on that, but using a different subsample, that had some theoretical basis for being more robust than just doing a single PCA.
I'm working on Utah Array data (400 um interelectrode spacing) so I don't need the neighbourhood adjustments. I am just trying to get a firm grasp on the branched isosplit5.
Great, thanks for digging into the code and asking!
In
compute_event_features_from_timeseries_model
, the principal components are calculated from a subset of clips (in time-amplitude space) then the component weightings are applied to all clips, resulting infeatures
.Then, in
branch_cluster
, these pc features are passed tocluster
, which once again does PCA on a subset of data, then applies the weights to all feature-vectors, resulting inpc
'dpc
features. (PPC? superintendent components?)I understand why the PCA happens inside
cluster
-- each successively deep branch is operating only on the events sharing a label assigned in the previous branch, so we're dealing with a different subspace. But I don't understand why the outer PCA happens. Is there any benefit to doing a PCA on data in time-amplitude space, only to then do another PCA on these features? Shouldn't the PCA insidecluster
effectively do everything the first outer PCA is intended to do?