Closed DaniJonesOcean closed 3 years ago
Here's some t-SNE code from the carbon clusters paper:
perplexity = 5
tsne = manifold.TSNE(n_components=n_components, init='random',
random_state=0, perplexity=perplexity)
Y = tsne.fit_transform(X)
figN1 = plt.figure()
axN1 = figN1.add_subplot(1,1,1)
plt.scatter(Y[:,0],Y[:,1],c=labels_nonan+1,norm=norm, cmap=cmap)
Hi @maikejulie! Could you add thoughts here about variable preprocessing and dimensionality reduction, please? It would be good to explore different approaches.
I should specify: I 'have a look' at the data both before and after applying scaling and also running it through something like t-SNE or UMAP.
Hi @maikejulie! That makes sense. I've done that sort of thing at various hackathons. I wonder why I've never thought to try it with unsupervised classification. 🤔
I guess by "density issues" you mean situations where the values are distributed in such a way as to potentially bias or otherwise complicate the rest of the analysis. Right? For example, mixed layer distributions have a very non-gaussian distribution and are sometimes log-transformed before being fed into an algorithm's training process.
Actions on @DanJonesOcean:
I haven't had any luck plotting pairplots of the non-transformed data with all dimensions. I think there are too many, and it always makes my kernel crash. I do have this pairplot of the three principal components though:
This is what is given to GMM. I guess it's a bit "spiky" in places, which is worth considering...
Here's t-SNE with some initial cluster labels (still experimenting with the colourbars).
It is a bit 'spiky' perhaps... Have you tried using some sort of kernel PCA?
Hi @maikejulie! That makes sense. I've done that sort of thing at various hackathons. I wonder why I've never thought to try it with unsupervised classification. thinking
I guess by "density issues" you mean situations where the values are distributed in such a way as to potentially bias or otherwise complicate the rest of the analysis. Right? For example, mixed layer distributions have a very non-gaussian distribution and are sometimes log-transformed before being fed into an algorithm's training process.
Actions on @DanJonesOcean:
* Make some pairplots/joinplots both before and after scaling
Yes, that is what I was referring to.
Here's t-SNE with some initial cluster labels (still experimenting with the colourbars).
Looks cool! I default to categorical colourbars from seaborn. Not terribly useful if you have a few though!
It is a bit 'spiky' perhaps... Have you tried using some sort of kernel PCA?
Thanks for the suggestion. :)
I've not been able to get this to work so far. Even with a relatively small training dataset, this method seems to always kill my ipython kernel. I'm guessing that it uses a lot of memory; I think I'll have to wait until I can properly migrate onto HPC to use a large-memory method like this.
In order to migrate onto our local HPC, I'll need to solve my weird container/environment issues. I always seem to be having those...I thought that Docker was going to solve all my problems in that arena, haha.
Ooh, that tSNE plot was from a very small sample (1000 profiles!). Here's what it looks like if you use a much larger fraction of the profiles and color-code it.
They're kind of reasonably separate in this tSNE space, I'd say. You probably won't do much better than this with our highly-correlated ocean data. Would you agree?
Documenting some UMAP runtime errors here:
Exception ignored in: <function Image.del at 0x7f883e549040> Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/tkinter/init.py", line 4017, in del self.tk.call('image', 'delete', self.name) RuntimeError: main thread is not in main loop Tcl_AsyncDelete: async handler deleted by the wrong thread
Kernel PCA didn't seem to get rid of the "spikes". The spikes could be an artefact of how the intervals are chosen in Seaborn. Perhaps wider intervals would be better.
UMAP produces interesting results, but I'm running into lots of crashes and memory errors. I may not have time to properly use UMAP just now.
I'm closing this issue for now, as we've kind of explored this.
Perhaps t-SNE would be sensible. It helps in cases where the structures aren't strictly Gaussian, as assumed by GMM.