mickcrosse / mTRF-Toolbox

A MATLAB package for modelling multivariate stimulus-response data
https://cnspworkshop.net
BSD 3-Clause "New" or "Revised" License
78 stars 29 forks source link

Bug in cell_to_time_samples.m? #3

Closed solleo closed 4 years ago

solleo commented 4 years ago

Hi Mick,

I was trying to do leave-one-trial-out using the new mTRFcrossval.m so I changed https://github.com/mickcrosse/mTRF-Toolbox/blob/8b517986d7a9bcc987a3d06b3abe47b380800914/mTRFcrossval.m#L128 to

rndtmsmp = 1:tottmidx;

Then I've got all zeros for the training data for the first CV-fold (where testing=1st trial, training=2nd~last trials), which was very surprising.

So I dug up more and found that https://github.com/mickcrosse/mTRF-Toolbox/blob/8b517986d7a9bcc987a3d06b3abe47b380800914/cell_to_time_samples.m#L12 is actually returning the number of time-points of the first trial (cell) only, not all the trials (cells). And because of this, any time-points outside the first trial could not be extracted, thus all matrices for the training data were zeros as initialized. So, if I understood the function correctly, I believe the Line 12 should be:

allidx = cellfun(@(x) size(x,1), xcells);

in order to get all the numbers of time-points of all cells.

natezuk commented 4 years ago

Thank you for notifying us about this issue, and sorry about the delayed reply. This relates to one of the edits I made to the mTRF toolbox. I have made the change you suggested to cell_to_time_samples, and I have added a check at the beginning of the function which rotates xcells in case the input is a row-shaped cell array.

However, changing line 128 to rndtmsmp = 1:tottmidx; alone is unlikely to do leave-one-trial-out. The data is split evenly into 10 folds on the following line, so unless the data contains exactly 10 trials with the same number of time samples, some of the folds will contain data from more than one trial. If you also replace line 130 with: foldidx = cumsum([0; cellfun(@(n) size(n,1),x)]); (and if the input stim and resp have been transformed into column-shaped cell arrays) that will do leave-one-trial-out cross-validation.

Mick and I are working on a separate function that does trial-by-trial testing, we should have that up soon. But I recommend doing 10-fold cross-validation instead of leave-one-trial-out when determining the optimal lambda, because 10-fold cross-validation produces more consistent and interpretable tuning curves.

solleo commented 4 years ago

Thanks for a reply, Nate. 1) the inputs were row-cell-arrays (1 x #trials) because it was the convention of the previous codes and it is still how it describes its inputs: https://github.com/mickcrosse/mTRF-Toolbox/blob/8b517986d7a9bcc987a3d06b3abe47b380800914/mTRFcrossval.m#L17

2) Yes, you're right. I forgot to mention that I also changed nfolds to equal to the number of runs entered.

3-1) Thank you for the recommendation. But do you know any reference (theoretical background and/or empirical validation) for that 10-fold CV being more consistent than LOOCV? I can only guess it might be related to the number of trials you have? And what about split-half then (the other extreme)? 3-2) Also, wouldn't it be possible that randomly selecting timepoints as you do now in https://github.com/mickcrosse/mTRF-Toolbox/blob/8b517986d7a9bcc987a3d06b3abe47b380800914/mTRFcrossval.m#L128 disrupts temporal dependency in the data (and leading to a wrong estimation of the lambda that controls it in the prediction)? This seems to be motivated to equally divide data into N-folds. But I think one can still work on the trial-level randomly leaving reminders (e.g., for 23 trials, throw out 3 trials randomly, then use 2 trials for each fold).

natezuk commented 4 years ago

1) Ok, thanks for reminding me about that part of the documentation, that's my mistake. I have mostly been working with vertical cell arrays. In it's current form it should now work with stim and resp that are oriented either as vertical or horizontal cell arrays. Let me know if you continue to run into issues with this.

3-1) The issue relates more to the variability of the data from trial to trial. Below I have attached a figure from Matlab which compares envelope-based forward modeling with 10-fold CV to leave-one-trial-out CV with 10 trials. I used Subject 2 from Broderick et al's (2019) Natural Speech dataset on Dryad, and the code I ran to generate this is demo_10CV_vs_trial_forward.m in the cv_help branch. Subject 2's data is somewhat noisy and the trial-by-trial tuning curves are all over the place. With less noisy data I think this is less of an issue, but in general 10-fold CV is more reliable.
Also, I am in favor of a split-half approach for testing an optimized model, but I think there should be more than two splittings of the data for cross-validation when tuning lambda in order to get a clearer picture of the variance in the error for each lambda. Subject2_CV_loo_forward

3-2) Admittedly, I'm not sure I entirely understand your point here. The sampling occurs after the design matrix is created, so sampling rows of the design matrix randomly will still maintain the temporal dependence in the input data relevant for modeling the mTRF and predicting the output. The estimate for lambda should be more reliable too (at least for the training data provided), because the X'X and X'y on each fold will more consistently represent the average across the training set.

solleo commented 4 years ago

I see it now. Thank you so much!