kimtonyhyun / analysis

2 stars 0 forks source link

Evaluation of tensor factorization #227

Closed kimtonyhyun closed 5 years ago

kimtonyhyun commented 7 years ago

@ahwillia Here is my plan for evaluating the tensor factorization code on prefrontal calcium imaging data:

  1. Compare "naive" vs. "align" exports (per #226). In particular, I would like to do a detailed accounting of the quality of the tensor fit to the raw data, and whether the different timewarp methods result in differing degrees of fit quality.
  2. Compare unconstrained vs. non-negative factorization. Intuitively I find non-negative factorization to be more appealing (as long as the raw data is properly formatted, i.e. #221) since non-negative factors cannot numerically compensate one another (and hence more directly interpretable, perhaps?).
  3. Investigate factors at the single neuron level, serving as a sanity check / validation, and perhaps more? I will see if I can find single cell support for the notion that there is integration of error in the mPFC.

I'll post thoughts and conclusions in this issue as I go through the above.

For the time being, I will apply factorization to single days. I think we need to hold off on multi-day analyses until we figure out a reasonable solution for cross-day trace "normalization", which warrants its own issue thread (#228).

ahwillia commented 7 years ago

Awesome. On my way to the airport now - I'll check back in a day or so! Thanks again.

On Dec 2, 2016 11:33 AM, "Tony" notifications@github.com wrote:

@ahwillia https://github.com/ahwillia Here is my plan for evaluating the tensor factorization code on prefrontal calcium imaging data:

  1. Compare "naive" vs. "align" exports (per #226 https://github.com/schnitzer-lab/analysis/pull/226). In particular, I would like to do a detailed accounting of the quality of the tensor fit to the raw neural data.
  2. Compare unconstrained vs. non-negative factorization. Intuitively I find non-negative factorization to be more appealing (as long as the raw data is properly formatted, i.e. #221 https://github.com/schnitzer-lab/analysis/pull/221) since non-negative factors cannot numerically compensate one another (and hence more directly interpretable, perhaps?).
  3. Investigate factors at the single neuron level, serving as a sanity check / validation, and perhaps more?

I'll post thoughts and conclusions in this issue as I go through the above.

For the time being, I will apply factorization to single days. I think we need to hold off on multi-day analyses until we figure out a reasonable solution for cross-day trace "normalization", which warrants its own issue thread (coming later).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/schnitzer-lab/analysis/issues/227, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm20dngoXb7AeORVZ9vvf0fb9_1P5Ewks5rEHJ1gaJpZM4LC6F2 .

kimtonyhyun commented 7 years ago

Can you point me to the definition of Rsq in CP factorization? Is it an element-wise comparison of X and Xest?

ahwillia commented 7 years ago

It is defined here - https://github.com/schnitzer-lab/analysis/blob/master/tensor/fit_cpd.m#L44

Note that I really shouldn't be calling this R-squared since the data aren't necessarily mean-centered and I'm taking the norm rather than the sum-of-squares

ahwillia commented 7 years ago

But yes your intuition is correct - the residuals are computed elementwise, and then the norm of all residuals relative to the norm of the raw data gives a sense of goodness-of-fit.

kimtonyhyun commented 7 years ago

A somewhat random question as I organize my thoughts on evaluating reduced dimensionality fittings. Is the CP decomposition "incremental" with respect to increasing latent dimension in the way that SVD / PCA decomposition is in 2D?

Namely, in standard matrix algebra, any matrix A can be written as a sum of "factors" via SVD decomposition (where sigma_i are decreasing in magnitude): image

Let A_k denote the truncation of the above sum at the k-th term. This is a k-rank approximation of A, analogous to what we're doing in CP tensor decomposition.

Obviously, the (k+1)-rank approximation of A can be written as: image in other words, the 1,...,k-th factors in a (k+1)-rank approximation are exactly identical to those in the k-rank approximation, and only the (k+1)-th factor needs to be computed.

So, the question: If I compute a (k+1)-rank CP decomposition of a 3D tensor by your algorithm, are the first k factors expected to be the same (up to numerical noise) as the k-rank CP decomposition?

ahwillia commented 7 years ago

Nope! This property of truncated-SVD (the Eckert-Young theorem) only holds for CP decomposition on matrices!

http://www.sandia.gov/~tgkolda/pubs/pubfiles/SIAM-39446.pdf

On Wed, Dec 7, 2016 at 5:26 AM, Tony notifications@github.com wrote:

A somewhat random question as I organize my thoughts on evaluating reduced dimensionality fittings. Is the CP decomposition "incremental" with respect to increasing latent dimension in the way that SVD / PCA decomposition is in 2D?

Namely, in standard matrix algebra, any matrix A can be written as a sum of "factors" via SVD decomposition (where sigma_i are decreasing in magnitude): [image: image] https://cloud.githubusercontent.com/assets/2081503/20954604/9ea652f0-bbf1-11e6-9cd3-ec0ad3940527.png

Let A_k denote the truncation of the above sum at the k-th term. This is a k-rank approximation of A, analogous to what we're doing in CP decomposition.

Obviously, the (k+1)-rank approximation of A can be written as: [image: image] https://cloud.githubusercontent.com/assets/2081503/20954589/75bbd180-bbf1-11e6-9556-237cd617b762.png in other words, the 1,...,k-th factors in a (k+1)-rank approximation are exactly identical to those in the k-rank approximation, and only the (k+1)-th factor needs to be computed.

So, the question: If I compute a (k+1)-rank CP decomposition of a 3D tensor by your algorithm, are the first k factors expected to be the same (up to numerical noise) as the k-rank CP decomposition?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/schnitzer-lab/analysis/issues/227#issuecomment-265352580, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm20dvBmnll_xdyvjcUk3sRikjF9qN7ks5rFjWIgaJpZM4LC6F2 .

kimtonyhyun commented 7 years ago

Interesting... this means that if we were to find an interesting factor at some decomposition rank r, that factor is not guaranteed to persist for higher rank decompositions. Something to think about...

kimtonyhyun commented 7 years ago

I'm focusing on non-negative factorizations. When I run:

>> [cpd_list, rsq] = fit_cpd(X, 'min_rank', 1, 'max_rank', 15, 'num_starts', 10, 'nonneg', true);
>> visualize_rank(cpd_list);

I typically find that a few models have "unexplained variance" greater than 1: c11m1d15_export-align_cpd-vis

I observe these cases only with nonneg=true. Is the fit just failing on a few instances?

ahwillia commented 7 years ago

Hmm weird. What do the factors look like?

kimtonyhyun commented 7 years ago

I checked one of the models (rank=15, rsq<0). It appears that the factors are identically zero, and expanding to Xest produces a tensor where every entry is zero. Looks like a bug?

ahwillia commented 7 years ago

I wouldn't call it a bug - that sounds like it could be a pathology of the optimization procedure we're using... Is the data tensor fully non-negative?

Come to think of it, the model parameters should be initialized as non-negative, and I might have forgot this :thinking:. What branch are you on? I can quickly push a tentative patch.

Edit: Nevermind it looks like it is initialized reasonably.

kimtonyhyun commented 7 years ago

Is the data tensor fully non-negative?

The baseline is centered around zero, but the data tensor is not fully positive (see #221 for an example calcium trace histogram after baseline correction).


A quick question. For a fixed neuron, I am playing with a visualization that displays (among other things):

As in: neuronfit_cell69

To measure quality of fit, I was computing squared residuals (e.g. per trial). On the other hand, this measure has the consequence that the best fit trials are often flat-lines since they have very low residuals: neuronfit_cell10

Can you suggest for me measures of fit that doesn't tend to favor flat lines (i.e. ones that take into account the non-trivialness of the modeling)?

ahwillia commented 7 years ago

The baseline is centered around zero, but the data tensor is not fully negative

I bet you this is causing the optimization algorithm to mistakenly converge to zero which is a saddle point of the objective function.

Can you suggest for me measures of fit that doesn't tend to favor flat lines (i.e. ones that take into account the non-trivialness of the modeling)?

Yep! You can normalize by the amount of variation in the raw data, giving you R-squared or similar measures. The gist is:

x = % vector of raw data for a neuron
x_est = % vector of the CP decomp estimate
rsq = 1 - sum((x-x_est).^2) / sum((x-mean(x)).^2)

You could also put a hard threshold on var(x) - if that is below a certain threshold, exclude it from plotting.

kimtonyhyun commented 7 years ago

@ahwillia I will describe the actual code for inspecting the CPD fit (and more generally, reduced dimension fittings) at the individual neuron level in a separate PR. Here, I would like to give a high-level description of my approach.

As I was beginning from the tensor documentation, I became interested in R squared (or, fraction of variance explained) as a measure of evaluating the quality of a fit to a high-dimensional dataset like our prefrontal Ca2+ traces. For instance, CPD fit with rank=15 seemed to leave ~0.6 "unexplained variance". My initial thoughts were:

From these questions, I realized that what I needed was an intuitive feel for the R squared measure in the context of fitting our calcium traces. I thus decided to inspect the raw and CPD fitted traces at the single neuron level. For instance, I wanted to know whether:

The last question is important, for example, in the context of verifying the "integrating error" CPD factor. We want to be sure that the fitting does not "make up" features in the dataset (for whatever reason) that are not present in the raw data!

I started with a single session dataset (switching day, c11m1d15), given the unresolved issues of #228 for multi-session fitting. I removed the baseline of the traces (per #221), and I used timewarp_method='align' for the export (per #226). As for the CPD parameters, I used a non-negative fitting with rank 15. The visualize_neuron_ktensor output looks as follows (trials colored by "correct"): c11m1d15_visktensor_rank15_nonneg

For each neuron (487), I computed the R squared value between the raw and fitted traces (concatenated over all trials). I found the following distribution of R squared values: c11m1d15_rsqs-hist_rank15_nonneg Note that the R squared values range from 0.0318 (worst) to 0.9030 (best), and fills the range in between. Thus, at a single neuron level, the CPD fit does quite well for some cells but not for others. I now wanted to get an intuitive feel for the quality of the fit at different values of R squared.

I created a visualization tool that allowed me to compare the raw and fitted trace (to be described in a separate PR). Here is a screenshot of one neuron with R squared value of 0.9030 (the highest score in this CPD fit): rsq0_9030

Legend:

(An aside: I've observed a few cases where the intra-trial R squared value is negative. This can happen when the fit strongly deviates from the raw data for that trial.)

Here is an example plot of a neuron with a R squared value of 0.1030. You can see that the fit is essentially a flat line, and does not capture the calcium events in the real data: rsq0_1030

I investigated the fits for varying values of R squared. At this link, you can browse through examples of the comparison at R squared = 0.1, 0.2, ..., 0.9.

My tentative observations are:

Going forward, my idea is to evaluate the impact of various parameters (e.g. timewarp_method='align_to' vs. 'naive'; increasing rank; non-neg vs. unconstrained CPD) on the per-neuron R squared values and their distribution. For example, by eliminating the non-negativity constraint the per-neuron increase in R squared values is so-and-so...

Any thoughts or comments so far?

ahwillia commented 7 years ago

Excellent - these figures look great. A couple quick comments:

P.S. I like the normalization approach I mentioned two weeks ago in - https://github.com/schnitzer-lab/analysis/issues/228 - is that what you used?

kimtonyhyun commented 7 years ago

P.S. I like the normalization approach I mentioned two weeks ago in - #228 - is that what you used?

Nope. I had actually worked on this prior to your comments in #228. I'll get back to the normalization method that you suggested, but here I used only baseline removal of traces per #221.

Do we think these flat-lining neurons are a bit strange? Are they experimental artifacts (e.g. the cell goes out of view or dies?)

I think you and I are using the term "flat-lining" but meaning different things. By "flat-lining", I was referring to the case where the CPD fitted trace was predicting a flat trace, rather than the raw data having a flat line.

Here is an exemplary case of my kind of flat-lining, where the raw data shows sparse activity throughout the session but the CPD-predicted trace is essentially a "flat line". Note that I am using the residual measure for this plot (see discussion of multiple scores in #231) since the residual highlights the trials where there is a large absolute deviation between the raw and fitted traces: c11m1d15_evalfit_cell11_residual

A key signature of this (my) type of flat-lining is that the raw vs. fit value scatterplot (top right subpanel) is essentially a vertical line. In the above example, even when the raw trace ranges in value from 0 to ~16, the predicted trace is essentially 0.

Some more examples where the CPD prediction is a flat trace. Trials sorted by the residual measure: c11m1d15_evalfit_cell30_residual c11m1d15_evalfit_cell42_residual

On the other hand, in your reply, I think you are referring to neurons whose real raw data shows a flat line. These are pretty rare to come by, since such minimal-activity neurons are not likely to pass the manual classification phase of our preprocessing.

With regards to Cell 28 (reproduced below), I think it's interesting that the amplitude (of the raw trace) tends to decrease as the session progresses, but I believe that this feature is actually present in the DFF movie, and not a processing artifact. I have seen this type of characteristics in a few neurons, where there is heightened activity at the beginning of each session (which can be observed on multiple days). c11m1d15_evalfit_cell28_rsq

For the example of "flat-lining" above, shouldn't the first 6 or so trials show up in the "worst" set of 10 trials?

Indeed, if I switch to the residual measure of fit (rather than R squared), the first few trials show up in the worst set of 10: c11m1d15_evalfit_cell28_residual

Because the R squared takes into account the variance of the data to be fitted, I think that might somehow be discounting how bad the fits are for Trials 1–6. The residual score is more naive, and directly highlights those same trials as expected from our intuition.

Given this, it might be nice to make a scatterplot of variance of the neuron's firing versus the r-squared of its fit. I would be most interested in finding neurons that are both highly active (i.e. not flat-lined) and poorly fit

I agree. This was my motivation for asking (in the past) about a measure of fit that takes into account the "non-trivialness" of the data to be fitted.

Note that CP decomposition isn't directly optimizing R-squared, it is trying to minimize the total squared error.

I think this is perfectly fine. Different dimensionality reduction methods will have their own objective metrics. I view my current effort as just evaluating the final quality of their fits at the single neuron level (where I can apply some intuition).

ahwillia commented 7 years ago

I see, thanks for the clarification. I guess I view cell 30 and 42 as potentially more interesting than cell 28, which has a pretty simple story of decreasing amplitude as the session continues. In fact, I've seen factors in the CP decomposition capture these sorts of dynamics (see the factor below... the green colors trials on different days):

image

Potential next steps / questions:

kimtonyhyun commented 7 years ago

Do we think these dynamics (in cell 28) and in the factor I posted above are real? Or due to some experimental artifact?

I just re-examined Cell 28 in the DFF movie. The trace definitely reflects the activity in the movie, so the dynamics isn't some numerical artifact introduced by our cell extraction algorithm at least.

I also examined this cell on the other sessions. It has the same phenotype (active at beginning of sessions, then diminishing amplitude) on nearly every day: c11m1d15_cell99 (Ignore that the cell has index "99" in the above screenshot. Before I ran export on c11m1d15 for the CPD fit, I got rid of all non-cells, which turned cell index 99 into 28.)

Furthermore, I've seen this pattern (diminishing amplitude over the course of the session) on several other cells. Out of 252 cells that are present on every session from c11m1d12–c11m1d18, I found at least 12 cells that show the diminishing-amplitude phenotype. You can see those rasters here: https://stanford.box.com/s/xxaqzrz0039oemqtqn17h5gnln9un79n

This evidence suggests to me -- rather strongly -- that the diminishing-amplitude phenotype is "real", i.e. reflects the actual activity of our prefrontal neurons.

However, to consider this phenotype seriously for any conclusion, I think we need to formally eliminate at least two possible artifactual sources for this observation:

  1. Bleaching of cells by illumination: It could be the case that we're destroying the GCaMP in these cells via illumination. On the other hand, if so, these cells are bleaching very quickly (i.e. in <10 trials, where a trial can be ~12 seconds long). We have recordings of the same mouse under continuous illumination for 20 minutes acquired after the strategy shifting sessions were complete. We could quantify the rate of bleaching in the control session and compare.
  2. Mechanical shifts in the scope bringing cells in and out of focus: Seems unlikely to me. Nevertheless, we could consider cells in the spatial vicinity of these diminishing-amplitude cells, and see if any such neighbors are constant amplitude throughout the session.

It would be interesting to follow up on cell 30 and 42 and see if we can correlate their activity to something in the task.

I looked at these cells with my usual browse_rasters: c11m1d15_raster_cell30 c11m1d15_raster_cell42

At this naive level, it doesn't seem that Cells 30 and 42 correlate to the task at hand.

Related to above - it would be interesting to use CP tensor decomposition to identify outliers in the dataset.

From a technical standpoint, we could think about tensor decomposition models that are robust to outliers. However, it doesn't seem like the CP model is being negatively impacted by these sparse events

It is also possible that there are prefrontal neurons that simply don't reflect the strategy shifting task. The activity of such unrelated neurons wouldn't be expected to follow any structure imposed by the trial-periodicity of the task, and hence are "missed" by the CPD fit which operates from that periodic structure via the organization of X.

In other words, perhaps we shouldn't be asking dimensionality reduction fits to explain 100% of the variance. If the missed variance is coming from the "random" activity of task-unrelated cells, that might be okay. On the other hand, I would be more concerned if there are task coding cells whose activity is completely missed by the CPD fit. (It's probably worthwhile to check if any of the cells with low R squared values have clear task coding.)

More strategically, by characterizing the sparse outliers, we can probably make stronger statements about the predictive power of the model. At the moment, we are flatlining at 50% or higher of unexplained variance. But if much of the remaining 50% are these sparse outliers, we can say our model is still actually providing a good characterization of the dataset, and we have a handle for understanding the remaining 50% of the variance (see points 2 & 3). I could see this helping a skeptical reviewer understand the technique.

I think you're saying the same thing here.

kimtonyhyun commented 7 years ago

What do you think about my (admittedly arbitrary) grouping of:

(You can use eval_fit in #231 to inspect the fits at the single neuron level.)

Based on the above standard, the CPD fit (rank=15, non-negative) for c11m1d15 yields: 75 (15%) good fits, 184 (38%) okay fits, and 228 (47%) bad fits at the single neuron level.

Next I will check if any of those "bad" fits (nearly half the neurons!) have clear task coding on c11m1d15...

ahwillia commented 7 years ago

I'm brainstorming visualizations for single-cell reconstructions. The image below was produced by python code, but we could make something similar in matlab.

I think something along these lines would help us visualize how mixed selectivity cells are modeled by multiple demixed factors. Note: This is data from Krishna's lab

image

Update: Here is another visualization along the same lines which looks at trial-averaged activity within conditions. For example the blue and orange could be east vs. west starts for a neuron. Note: This is data from a synthetic model network.

image

kimtonyhyun commented 7 years ago

Using the non-negative CPD factorization, I sifted for the cells that had a single-cell Rsq value >0.7 (i.e. what I previously called "good" fits). The threshold of Rsq>0.7 yielded 75 out of 487 cells (15%).

Looking at the raster plots of these "good" cells, I found that they have a very well stereotyped trial-to-trial response. For example: c11m1d15_nncpd-good_p1

You can see the rasters of all "good"-fitting cells here: https://stanford.box.com/s/aggeaq0423xnjk0juf8me6foi7h5mwye

Among the "good" cells is one with "mixed" representation. Namely, Cell 287 appears to encode for East-start as well as Error: c11m1d15_cell287

Here is the non-negative CPD fit to Cell 287: c11m1d15_cell287_nncpd-fit

We can see if this cell has the expected factor loadings (i.e. to start-location and error factors).

Generally, it seems to me that the CPD fittings do a better job (at the single neuron level) for cells that have well-stereotyped firing trial-to-trial. The fits seem to have a hard time capturing cells that fire sparsely and over a narrow time window. All of this is expected, of course, but just re-stating the observations...

kimtonyhyun commented 7 years ago

Similarly, here are the "bad" fits from the rank=15, non-negative CPD factorization, where "bad" was defined to be a single cell Rsq < 0.4: https://stanford.box.com/s/kdp4fmhpw2syb43fwmra6bt4qrxvhked

There are definitely task-variable encoding cells in this "bad"-fitting group. For example, Cell 14 has an apparent preference for East-starts (though not at 100% fidelity): c11m1d15_cell14

The single-cell Rsq value for this cell is 0.3610. The following eval_fit plot (sorted on "Signed diff") shows that the fit is poor due to multiple reasons:

kimtonyhyun commented 7 years ago

I am often finding that the (non-negative) CPD fit has limited capability of capturing transient exponentials in the raw trace data.

Here is Cell 27, which has activity near the beginning of trials but not a particularly consistent one: c11m1d15_cell27

The single-cell Rsq value is 0.3142, and the eval_fit plot is as follows: c11m1d15_cell27_fit

The temporally-localized Ca2+ transients (which are missed by the CPD fit) are real -- I re-verified the trace just now against the DFF movie.

Another example, Cell 29 has preferential coding for North-end: c11m1d15_cell29

The single-cell Rsq value is 0.3225: c11m1d15_cell29_fit

If the CPD fitting generally is unable to fit Ca2+ exponentials well, do you think we ought to consider some sort of processing of the trace data prior to fitting (i.e. don't try to fit the fluorescence values directly)?

ahwillia commented 7 years ago

I might add a figure similar to this for the methods manuscript. The different colormaps are neurons - each is a n_trials x n_time raster image (white = inactive).

Thoughts/comments on the appearance? Does it make sense what this shows? My thought is that this is a very quick way for a reader to validate that we can reconstruct a good chunk of the data. But maybe also you can pick out a few sparse outliers by eye?

image

kimtonyhyun commented 7 years ago

That's a nice visualization. Some questions:

My thinking was that, if we can display all neurons and their reconstructions via side-by-side rasters this way, then it could be a nice way to inspect how much of the data we've reconstructed. We could sort the neurons by some sort of CPD fit score (e.g. Rsq) too.

For the manuscript, showing just a handful of neurons as you've done here is probably sufficient to get the point across that CPD can fit many individual neurons in the dataset.

ahwillia commented 7 years ago

Did you make the plot in Matlab? (In that case, we would get it on our repo...)

No, python :frowning_face:

Are those 100s of trials squeezed into a short y-extent?

Yes, 600 trials to be exact. I think it is worth playing with the image dimensions and selecting certain trial types (e.g. correct vs incorrect). This is just all the trials.

How many neurons are shown here? Could we squeeze the x-extent (time axis), and display all neurons in a tensor (perhaps multiple columns in figure)?

This is 40. I'll play with it and see how many I can squeeze in with multiple columns.

We could sort the neurons by some sort of CPD fit score (e.g. Rsq) too.

Yep was planning to do this. Great minds and all that.

For the manuscript, showing just a handful of neurons as you've done here is probably sufficient to get the point across that CPD can fit many individual neurons in the dataset.

👍