maxentile / advanced-ml-project

autoencoding for fun and profit
MIT License
3 stars 0 forks source link

SPADE updates #17

Open maxentile opened 9 years ago

maxentile commented 9 years ago

So a couple things that came up in our meetings were:

I just found a paper with time-course mass cytometry data, a clearer analysis task, and a bit of discussion / an attempt to correct some problems in SPADE:

Their analysis of SPADE / suggested alternative: "The MST used by SPADE is susceptible to overfitting the data and is not robust to local variation... to robustly identify high-dimensional relationships between cell types, the MST was replaced with a more highly connected graph structure... This new graph structure is then employed to produce a force-directed layout of a weighted graph containing multidimensional agglomeratively clustered points (FLOW-MAP) plot. The FLOW-MAP layout of cell clusters is more reproducible than a MST-derived layout, because the underlying graph structure is highly connected and therefore less susceptible to local edge and cluster variability"

Cartoon of FLOW-MAP vs. SPADE image

Example output of FLOW-MAP: image

Initial comments:

Some questions:

maxentile commented 9 years ago

So, I just did something simple that works pretty well and gave results that should probably be biologically interpretable-- also might be a good baseline for better approaches we might come up with:

Projection onto PLS components yields a very different qualitative picture of the data than the stringy connected clusters in "FLOW-MAP" : image

We can also just look at the first component a bit more clearly by plotting density, rather than individual cells: image

Next up:

maxentile commented 9 years ago

As an alternative to plopping all timepoints onto the same axes, I also created animated GIFs where each frame is a timepoint: http://imgur.com/a/b4qjl (scatterplot colored by local point-density)

maxentile commented 9 years ago

Oh, just noticed another issue: the number of cells observed each day varies widely within each reprogramming trajectory (~5-10x): image

We'll probably need to compensate for this somehow.

maxentile commented 9 years ago

Here's another way to plot how the distribution of cells along some progression axis changes over time: encode probability density as color intensity, and then have time on the x axis and progression on the y axis (each timestep is a column) image

maxentile commented 9 years ago

Another way: represent each timestep by its vector of probability density, and then compute a distance matrix between all timesteps: image

maxentile commented 9 years ago

We can also use distributions along a discrete progression function* instead of continuous distributions over a linear progression function: image

(*here defined by doing k-means on the full-dataset, then for each day counting how many cells belong to each cluster)

maxentile commented 9 years ago

So here's the thing I said I'd make where we use 2D contours instead of a scatterplot anim