Consult with Maike and Issa on variable preconditioning and dimension reduction (alternatives to PCA)

DaniJonesOcean commented 3 years ago

Perhaps t-SNE would be sensible. It helps in cases where the structures aren't strictly Gaussian, as assumed by GMM.

DaniJonesOcean commented 3 years ago

Here's some t-SNE code from the carbon clusters paper:

    perplexity = 5
    tsne = manifold.TSNE(n_components=n_components, init='random',
                         random_state=0, perplexity=perplexity)
    Y = tsne.fit_transform(X)
    figN1 = plt.figure()
    axN1 = figN1.add_subplot(1,1,1)
    plt.scatter(Y[:,0],Y[:,1],c=labels_nonan+1,norm=norm, cmap=cmap)

DaniJonesOcean commented 3 years ago

Hi @maikejulie! Could you add thoughts here about variable preprocessing and dimensionality reduction, please? It would be good to explore different approaches.

maikejulie commented 3 years ago

Hi! I nominally start with a simple scaling, and up complexity if it looks necessary. A useful 'check' to see if it's necessary is to use one of seaborns' joinplots, as a hex/pdf-plot or two don't go amiss to assess potential density issues.

Seaborn Pairplots are similarly great!

maikejulie commented 3 years ago

I should specify: I 'have a look' at the data both before and after applying scaling and also running it through something like t-SNE or UMAP.

DaniJonesOcean commented 3 years ago

Hi @maikejulie! That makes sense. I've done that sort of thing at various hackathons. I wonder why I've never thought to try it with unsupervised classification. 🤔

I guess by "density issues" you mean situations where the values are distributed in such a way as to potentially bias or otherwise complicate the rest of the analysis. Right? For example, mixed layer distributions have a very non-gaussian distribution and are sometimes log-transformed before being fed into an algorithm's training process.

Actions on @DanJonesOcean:

Make some pairplots/joinplots both before and after scaling

DaniJonesOcean commented 3 years ago

I haven't had any luck plotting pairplots of the non-transformed data with all dimensions. I think there are too many, and it always makes my kernel crash. I do have this pairplot of the three principal components though:

pairplot_pca

This is what is given to GMM. I guess it's a bit "spiky" in places, which is worth considering...

DaniJonesOcean commented 3 years ago

Here's t-SNE with some initial cluster labels (still experimenting with the colourbars).

tSNE

maikejulie commented 3 years ago

It is a bit 'spiky' perhaps... Have you tried using some sort of kernel PCA?

maikejulie commented 3 years ago

Hi @maikejulie! That makes sense. I've done that sort of thing at various hackathons. I wonder why I've never thought to try it with unsupervised classification. thinking

I guess by "density issues" you mean situations where the values are distributed in such a way as to potentially bias or otherwise complicate the rest of the analysis. Right? For example, mixed layer distributions have a very non-gaussian distribution and are sometimes log-transformed before being fed into an algorithm's training process.

Actions on @DanJonesOcean:
* Make some pairplots/joinplots both before and after scaling

Yes, that is what I was referring to.

maikejulie commented 3 years ago

Here's t-SNE with some initial cluster labels (still experimenting with the colourbars).

Looks cool! I default to categorical colourbars from seaborn. Not terribly useful if you have a few though!

DaniJonesOcean commented 3 years ago

It is a bit 'spiky' perhaps... Have you tried using some sort of kernel PCA?

Thanks for the suggestion. :)

I've not been able to get this to work so far. Even with a relatively small training dataset, this method seems to always kill my ipython kernel. I'm guessing that it uses a lot of memory; I think I'll have to wait until I can properly migrate onto HPC to use a large-memory method like this.

In order to migrate onto our local HPC, I'll need to solve my weird container/environment issues. I always seem to be having those...I thought that Docker was going to solve all my problems in that arena, haha.

DaniJonesOcean commented 3 years ago

Ooh, that tSNE plot was from a very small sample (1000 profiles!). Here's what it looks like if you use a much larger fraction of the profiles and color-code it.

tSNE_tmp

They're kind of reasonably separate in this tSNE space, I'd say. You probably won't do much better than this with our highly-correlated ocean data. Would you agree?

DaniJonesOcean commented 3 years ago

Documenting some UMAP runtime errors here:

Exception ignored in: <function Image.del at 0x7f883e549040> Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/tkinter/init.py", line 4017, in del self.tk.call('image', 'delete', self.name) RuntimeError: main thread is not in main loop Tcl_AsyncDelete: async handler deleted by the wrong thread

DaniJonesOcean commented 3 years ago

Kernel PCA didn't seem to get rid of the "spikes". The spikes could be an artefact of how the intervals are chosen in Seaborn. Perhaps wider intervals would be better.

UMAP produces interesting results, but I'm running into lots of crashes and memory errors. I may not have time to properly use UMAP just now.

I'm closing this issue for now, as we've kind of explored this.

so-wise / weddell_gyre_clusters

Consult with Maike and Issa on variable preconditioning and dimension reduction (alternatives to PCA) #6