utterances-bot commented 2 years ago

Power Analysis by Data Simulation in R - Part IV | Julian Quandt

This final part extends the previous part by generalizing data simulation linear mixed-effects models.

https://julianquandt.com/post/power-analysis-by-data-simulation-in-r-part-iv/

piolho-eefe commented 2 years ago

Hi Julian, I have the situation where I want to know the necessary sample size for a "complex" model to fit the data better than a simpler one. In a previous paper, I used the BIC to compare the models. Can I, using the resulting parameters of the previous paper (including the residuals), know how many persons I need to have a probability of 90% of BIC favoring the more complex model? The model (in matlab) is the one below

%% Simulating power for the logistic curve

%Estimated paremeters of the previous paper a = 3.375; b = 11.42; p0 = 12.84; p1 = 0.758; %Standard deviation of the residuals e = 2.6849;

%just creating the equation to be fitted fo = fitoptions('Method', 'NonlinearLeastSquares',... 'Lower', [0, 0, 10, 0], 'Upper', [16, 16, 16, Inf], 'StartPoint', [2, 10, 12, 1]); logist = fittype('c1 + ((c2 - c1) / (1 + exp(-(x - p0) / p1)))', 'independent', 'x', 'options', fo);

%Broadly creating the sample/BIC percentage curve clear BIC %Creating a counter for saving each n count = 1; for n = 10:100:510 %Running 1000 iterations for it = 1:1000 %Just checking where the code is [n, it] %Creating the random independent variable fms = random('Discrete Uniform', 11, n, 1) + 5; %Creating the model using the parameters and some noise from the residuals tms = a + ((b + a)./(1 + exp(-(fms - p0)/p1))) + random('Normal', 0, e, n, 1); %Fitting the logistic curve [f, g] = fit(fms, tms, logist, fo); %Fitting the linear curve [flin, glin] = fit(fms, tms, 'poly1'); %Calculating the BIC (based on Shumway & Stoffer, 2011) sigma(1)=glin.sse/87; BIClin(it, count)=log(sigma(1))+(2log(87))/87; sigma(2)=g.sse/87; BIClog(it, count)=log(sigma(2))+(4*log(87))/87; end count = count + 1; end

%More detailed curve (within 50 and 150) clear BIC count = 1; for n = 50:110 for it = 1:1000 [n, it] fms = random('Discrete Uniform', 11, n, 1) + 5; tms = a + ((b + a)./(1 + exp(-(fms - p0)/p1))) + random('Normal', 0, e, n, 1); [f,g]=fit(fms, tms, logist, fo); [flin,glin]=fit(fms, tms, 'poly1'); sigma(1)=glin.sse/87; BIClin(it, count)=log(sigma(1))+(2log(87))/87; sigma(2)=g.sse/87; BIClog(it, count)=log(sigma(2))+(4*log(87))/87; end count = count + 1; end

julianquandt commented 2 years ago

Hi piolho-eefe. Sorry for my delayed reply. I think the notifications about posts here might end up in my spam - I'll have to change that.

In principle what you suggest should be possible. You can simulate the simple and the complex model and on each iteration compare them with BIC and see if the complex model is favored over the simple one, just as you could look at the p-value on every iteration. If you calculate the percentage of favoring the complex model over the simple one, you get the power for the BIC copmarison at a given sample size.

Kalil-Manara-from-UFRGS-Brazil commented 1 year ago

Should we add the genre random slope for songs as well? Try to think about this for a second. What would it mean to add the random slope for genre across songs here? It would mean that we would expect a certain song to differ in its propensity to be liked as a pop song compared to being liked as a rock song. Clearly, this does not make sense as a song is either rock or pop but cannot “switch” genre from rock to pop or vice versa. Thus, genre is a within-participant factor here but it is between-songs meaning that we cannot add a random slope here as it does not make sense.

I wonder if a random slope for genre across songs wouldn't really mean that we expect the impact of genre to change between songs. That would be a pretty plausible hypothesis, since, for example, there are songs that are less compliant to genre classification. Say for example, we have "Somebody Told Me" (from The Killers) at the Rock list. That would be arguably less prone to a genre effect than ACDC, right?

I'm new to Mixed Effects, so I may be making some confusion here, anyway.

Thank you, that's a very nice tutorial!

julianquandt commented 1 year ago

Hi Kalil,

Thanks for the compliments and for the question. I agree with you that music genres are often not well defined as the song you mention nicely examplifies. A statistical model is, after all, always a model that tries to simplify reality to enable us to understand it and that demands some sacrifices in how accurately it describes reality. We make such a sacrifice by demanding of the songs that they can be clearly classified into either Pop or Rock songs. I see that I have not chosen the clearest example (maybe genres like heavy metal and techno would be a better example though even there it might sometimes be unclear), but maybe that also makes it a educational example as these are uncertainties we need to deal with when building mixed models.

Thus, theoretically you are corrent in that it might not always be clear for songs to which genre they belong. However, in the model itself we just classify them into clear-cut cases, even if that might not only be true, as each song is (in this example) either taken from a pop list or rock list, and that "where is it taken from?" defines the genre in this example, not what it might be perceived as. We could do that theoretically, but we would need another design in which we can actually expect a song to be classified as rock and pop, for instance if we ask people to classify the songs themselves instead of inferring their genre from spotify playlists.

I hope this helps!

wk1879 commented 10 months ago

Hey Julian,

Thank you for this wonderful tutorial! It's really helpful for the beginers who want to jump on the "simulation" train.

I tried to follow each row of your words and was stuck at somewhere:

1) When defining the VCOV matrix for the random effect of songs, "sd_u0sd_w1corr_w01" was mistakenly written as "sd_u0sd_w1corr_u01" (e.g., line 985 in Rmd file).

2) It'd better to relevel the "genre" and "instrument" variables in order to successfully generate the DV by using your "add_contrast_vars" function. Taken "genre" as an example (levels = c("pop", "rock", "classic")), in default the function sets the first level ("pop") as the reference level, while the formular for generating DV uses the last levels ("classic") as the reference.

I hope this comment is also helpful for others.

Cheer! Kai

julianquandt / julianquandt.website

post/power-analysis-by-data-simulation-in-r-part-iv/ #3

Power Analysis by Data Simulation in R - Part IV | Julian Quandt