tinevez / msdanalyzer

A MATLAB class for Mean Square Displacement analysis.
36 stars 16 forks source link

FitMSD #1

Open nikos-95 opened 7 years ago

nikos-95 commented 7 years ago

Hello,

first of all great program! When fitting trajectories of shorter length (both simulated and actual measurements), however, I noticed a lot of negative diffusion coefficients and r² values (which both afaik shouldn't be possible) and dug into the code of fitMSD. If I interpret it correctly, the program determines how many time points correspond to 25% of the longest trajectory in the pool in terms of length, and than applies that to all other curves for their linear fit evaluation. This seems counter intuitive, as it includes very uncertain points from the shorter curves. The output even says "fitting the first 25% of EACH curve", but then the program doesn't take into account all the NaN points in the MSD data. This also leads to the following problem: when I have 100 tracks of length 8 each, the program will not fit any because they are all shortened to 1 point prior to fitting. If I add just one track of e.g. length 16 , then suddenly all tracks are properly fitted from point 2 to 16*0.25=4. See the issue? Edit: And yes, I realize that 25% of points of 8 is not 3, but this is beside the point here

Next, there is also another problem in regard to averaging the fits for a global diffusion coefficient. When I include ALL tracks, the estimate is about right (with a simulation of short tracks). However, when I include only the ones with r²>0.6 or so, a lot of the lower diffusion coefficients are discarded due to bad fit, causing an inflated estimate of total D. Maybe this will be solved with the problem above, when the fit can properly choose points and a correct r² is calculated for selection.

I think I came up with a coded solution for the point selection of each curve (not too hard) and can add it per request if you see the problem in the same way. However, I have no idea yet on how to deal with the negative D's or averaging in regards to low r2 so that the global mean is accurate, and what math says about the negative r2.

Thank you for reading.

tinevez commented 7 years ago

Hey @nikos-95

I will investigate. However I want to mention that it is strongly not recommended to perform MSD analysis with tracks as short as 8 points. You probably need 50 to 100. The reference by Michalet in the tutorial gives some hints about how many time points you need in your tracks.

tinevez commented 7 years ago

Also could you share an excerpt of the data that generates negative R2 values?

nikos-95 commented 7 years ago

Hi,

about the track length, I'l keep it in mind and watch for the accuracy with simulations. I don't have many alternatives though and will see if the MSDs are fitted well enough and average out. Thanks anyways I will read the paper. About the r2 values, here is an example where I generated 200 tracks with brownian motion (D=0.15) and of length 5-30 (weighted towards shorter but ignore that) by the method given in the tutorial. They look like this: image Then I performed a default fit (25%) via ma.fitMSD and you can see some negative values for r2 in the output. tinevez-msdanalyzer-example.zip I found out that matlabs gof is being read out for the degree-of-freedom adjusted r2 in this case, which can be negative

when the model contains terms that do not help to predict the response

(http://web.maths.unsw.edu.au/~adelle/Garvan/Assays/GoodnessOfFit.html)

I do not really understand though why this would happen as we have a simple linear fit and I will probably resort to reading out the normal r2. I hope that both r2 values will give results that are ordered in the same way so that only the threshold needs a slight adjustment.

I also made another interesting observation in this example and hope that it is reproducible: When I filter the resulting diffusion coefficients by fit goodness, for example r2fit >0.6, I get a mean diffusion coefficient of 0.18 +- 0.138 (N=156 of 200). Then, I changed the code of fitMSD so that t_limit is calculated from 25% of length of y AFTER NaNs have been trashed. This basically means that each MSD curve is fitted from point 2 to point (length/4). This results in the fact that less curves have more than 2 points and are fitted, and also they are fitted in a better way. The mean diffusion coefficient demonstrates it: For r2fit > 0.6, the mean dc is 0.149 +- 0.0723 from only N=83. So we have less points but a better fit, and we are in fact really close to the simulated diffusion coefficient of 0.15.

Really interesting to me and I appreciate your time!

nikos-95 commented 7 years ago

After some testing I want to add the observation that adjrsquare and rsquare values are not ordered in the same way exactly. For interpretation I sadly have to pass. I also found out that my "new implementation" with default setting considers tracks of at least 10 points for calculation of D, and r2fit will return actual values (not NaN) for tracks of length 14 or more, so that shorter ones will be discarded when filtering for fit goodness.

tinevez commented 7 years ago

Hi @nikos-95 I understand that the issue with negative R2 values comes from the fact that we report the adjusted R2 value, which can be negative if the data is really off. Again, let me reiterate that MSD analysis is not robust against small tracks. And a linear fit with 2 points is not a reasonable things to consider. It might get you close to the simulated coefficient, but this might be by chance,

I suggest in the issue to fix the t_limit issue with NaNs using your data.

nikos-95 commented 7 years ago

Ok, so the negative r2 was just a misunderstanding from my side, no worries there. I now understand that the adjusted r2 diminishes for smaller numbers of fit points, which makes sense. Specifically for two fit points it even results in NaN, as such a low count is not reliable.

I suggest in the issue to fix the t_limit issue with NaNs using your data.

Sorry, I don't quite get it. Do you mean that I should alter the code, or my data?

Anyways, here is the rearranged code that I am now using (excerpt of fitMSD):

`for i_spot = 1 : n_spots

fprintf('\b\b\b\b\b\b\b\b\b%4d/%4d', i_spot, n_spots);

msd_spot = obj.msd{i_spot};

t = msd_spot(:,1);
y = msd_spot(:,2);
w = msd_spot(:,4);

% Thrash bad data
nonnan = ~isnan(y);
t = t(nonnan);
y = y(nonnan);
w = w(nonnan);

% Clip data, never take the first one dt = 0
if clip_factor < 1
    t_limit = 2 : round(numel(t) * clip_factor);
else
    t_limit = 2 : min(1+round(clip_factor), numel(t));
end

x = t(t_limit);
y = y(t_limit);
w = w(t_limit);

if numel(y) < 2
    continue
end

[fo, gof] = fit(x, y, ft, 'Weights', w);

a(i_spot) = fo.p1;
b(i_spot) = fo.p2;
r2fit(i_spot) = gof.adjrsquare;

end`

In my experience it should work safely and correctly, feel free to adopt this change into the project.