Open kiante-fernandez opened 8 months ago
I think there are different ways to approach these types of models. Some time ago, Dominique made this example. Do you think there is an advantage of your approach?
FYI, you can simplify your code with choices[i], rts[i] = rand(dist)
.
That said, perhaps it would be a useful addition to have a data simulating function data_simulate(LBA; ...)
to generate more complex data. This would make the package a one-stop solution for validation & testing.
Such function should obviously be limited to a few options, but it could be at least stuff like:
n_conditions
: this argument would generate groups of data (e.g., the effect of condition B vs. A as in the example), and would be associated with a vector of coefficients that represents the "effect"
data_simulate(LBA; v=[1, 2], n_conditions=2, condition_effect=[[-1, 0.5]])
: would generate a dataframe with RT and choice and a Condition
column with 0 and 1; the drift parameters of condition 0 would be 1, 2
and that of condition 1 would be 0, 2.5
(+ the effect)n_groups
: this argument would generate groups of data (e.g., multiple participants) and would be associated with a parameters that drives the standard deviation between groups for the parameters
data_simulate(LBA; v=[1, 2], n_groups=20, groups_deviation=[[0.1, 0.3]])
I think with such 2 options one could already cover a lot of scenarios and allow for easier constructions of even more complex designs
I understand the desire for such a function, but one problem with your proposal is that it is too restrictive for a package. From what I can tell, the function has the following limitations: (1) observations per condition are fixed to the same number, (2) the effect only applies to the drift rate, (3) the distributions over the parameters are hard coded, (4) the user is forced to use DataFrames, (5) it's not clear to me how it is composable (i.e. forms the basis for more complex scenarios). However, making a more general function would be much more challenging. I've considered numerous options in the past, but none have emerged as a satisfactory solution. My recommendation for the time being would be to define a function in the documentation example. It is quite useful in that narrow use case.
General code for hierarchical/multilevel data generation:
function sample_parms(group_dists, parm_names, cond_idx)
parm_vals = map(d -> rand(d[cond_idx]), values(group_dists))
return NamedTuple{parm_names}(parm_vals)
end
function sample_condition_parms(group_dists, parm_names, n_subjects, cond_idx)
return [sample_parms(group_dists, parm_names, cond_idx) for s ∈ 1:n_subjects]
end
function sample_subject_parms(
model::Type{<:SSM2D},
group_dists,
n_subjects,
)
parm_names = keys(group_dists)
return [sample_condition_parms(group_dists, parm_names, n_subjects, c) for c ∈ CartesianIndices(group_dists[1])]
end
function simulate_multilevel(
model::Type{<:SSM2D},
subject_parms,
n_subjects,
n_trials;
factor_names = Tuple(map(s -> Symbol("factor_$s"), 1:ndims(group_dists[1]))),
subj_start_index = 1
)
n_total = sum(n_subjects .* n_trials)
parm_names = keys(group_dists)
choice = fill(0, n_total)
condition = fill(0, n_total, ndims(group_dists[1]))
subject_id = fill(0, n_total)
condition_indices = CartesianIndices(group_dists[1])
rt = fill(0.0, n_total)
cond_idx = 1
cnt = 1
subj_end_index = n_subjects + subj_start_index - 1
for c ∈ condition_indices
for s ∈ subj_start_index:subj_end_index
parms = subject_parms[cond_idx][s]
dist = model(; parms...)
for t ∈ 1:n_trials[cond_idx]
choice[cnt], rt[cnt] = rand(dist)
condition[cnt,:] .= c.I
subject_id[cnt] = s
cnt += 1
end
end
cond_idx += 1
end
n_factors = length(factor_names) + 1
factors = (factor_names[i] => condition[:,n_factors - i] for i in 1:length(factor_names))
return (;factors..., subject_id, choice, rt)
end
function simulate_multilevel(
model::Type{<:SSM2D},
group_dists::NamedTuple,
n_subjects,
n_trials;
factor_names = Tuple(map(s -> Symbol("factor_$s"), 1:ndims(group_dists[1]))),
subj_start_index = 1
)
subject_parms = sample_subject_parms(model, group_dists, n_subjects)
return simulate_multilevel(model, subject_parms, n_subjects, n_trials; factor_names, subj_start_index)
end
Example usage:
n_subjects = 50
n_trials = fill(1000, 4)
n_cond = (2,2)
group_dists = (
ν = [
MvNormal([1,1], [.2 0; 0 .2]) MvNormal([3,1], [.2 0; 0 .2])
MvNormal([2,1], [.2 0; 0 .2]) MvNormal([4,1], [.2 0; 0 .2])
],
A = fill(Normal(.8, .2), n_cond),
k = fill(Normal(.2, .1), n_cond),
τ = fill(truncated(Normal(.2, .1), 0, Inf), n_cond)
)
data = simulate_multilevel(LBA, group_dists, n_subjects, n_trials)
# convert to dataframe if you would like
df = DataFrame(data)
Example output:
20000×5 DataFrame
Row │ factor_1 factor_2 subject_id choice rt
│ Int64 Int64 Int64 Int64 Float64
───────┼──────────────────────────────────────────────────
1 │ 1 1 1 2 0.353357
2 │ 1 1 1 1 0.323123
3 │ 1 1 1 1 0.303511
4 │ 1 1 1 1 0.512396
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
19998 │ 2 2 50 1 0.321626
19999 │ 2 2 50 1 0.528021
20000 │ 2 2 50 1 0.376962
19993 rows omitted
combine(groupby(df, [:factor_1,:factor_2]), :choice => x -> mean(x .== 1))
Accuracies are correctly ordered:
4×3 DataFrame
Row │ factor_1 factor_2 choice_function
│ Int64 Int64 Float64
─────┼─────────────────────────────────────
1 │ 1 1 0.52608
2 │ 1 2 0.75488
3 │ 2 1 0.85032
4 │ 2 2 0.91176
TODO:
1. Consider separating generation of parameters from data generation, or also output parameters as separate NamedTuple.
2. Generalize indexing of conditions for factorial designs.
@DominiqueMakowski, I think the code above checks most of the boxes. The example is a within subjects 2X2 factorial design which affects one of the drift rates. The setup works for any arbitrary within subjects design affecting any parameter. If you want to add a between subjects condition, you can call simulate_multilevel
again and pass the appropriate inputs along with the keyword subj_start_index
set to the next subject index. The output can be transformed to a DataFrame and the between subject conditions can be concatenated. You can optionally define names for the factors with the keyword factor_names
. If you want the sampled parameters, you can generate those outside of simulate_multilevel
with sample_subject_parms
and pass the subject parameters to simulate_multilevel
.
The main unknown is how the data works with Turing. Iterating over a flat data structure is convenient, but it is not always the most efficient. One option is to restructure the data in cases where speed is critical.
I need to make some tweaks to the design. There are a few snags to work through.
As always, you took a rough thought and you're turning that into something well-thought and designed, *hats off*
I appreciate the vote of confidence. I fixed one issue and realized there is another one. Basically, I need to figure how to handle conditions for vector parameters. For example, if nu = [2,1], how do I select only one of the elements to vary?
I'm not sure I understand, your question. In my mind if nu = [2, 1], you could set nu_group [1, 0] to get only the first accumulator to vary but in your example above if I understand both groups must be specified in absolute terms? i.e., not as a relative difference as in a regression-like approach
The code above works well for between subject designs, but adapting to for within subject designs is more challenging. S
The example above is a 2X2 between subject design where the distribution for the first drift rate is affected by the experimental manipulation. If I give you the parameters for the first individual in each group we have:
4-element Vector{@NamedTuple{ν::Vector{Float64}, A::Float64, k::Float64, τ::Float64}}:
(ν = [0.8199582755225441, 1.2307055679022265], A = 0.709053974596606, k = 0.3823176118334306, τ = 0.30371883390220333)
(ν = [1.1185011250253114, 1.4099358614685493], A = 1.248192968252697, k = 0.3924284083579383, τ = 0.07018350969172629)
(ν = [2.0154737885699694, 0.24569280592047238], A = 0.8143113936695744, k = 0.29121205673366396, τ = 0.08830768540311051)
(ν = [3.5445628997504466, 1.2792388125567997], A = 1.1797017598956194, k = 0.07595608876702825, τ = 0.10324520656785674)
These are four independent individuals from the four different groups. The first drift rate is sampled from different distributions, but the other parameters come from the same distribution. In a within subjects, design, the first individual would like this:
(ν = [0.8199582755225441, 1.2307055679022265], A = 0.709053974596606, k = 0.3823176118334306, τ = 0.30371883390220333)
(ν = [1.1185011250253114, 1.2307055679022265], A = 0.709053974596606, k = 0.3823176118334306, τ = 0.30371883390220333)
(ν = [2.0154737885699694, 1.2307055679022265], A = 0.709053974596606, k = 0.3823176118334306, τ = 0.30371883390220333)
(ν = [3.5445628997504466, 1.2307055679022265], A = 0.709053974596606, k = 0.3823176118334306, τ = 0.30371883390220333)
In other words, some parameters need to be fixed, and others need to vary. The challenge is doing it for scalars such as A and for vectors such as ν. I think it will be easy to support the between subject only design, but adding within subject design is more difficult, especially when dealing with mixed parameter types.
I have been thinking about more general use cases, and one common thing I like to do is simulate data at the trial level or for multiple subjects. As an enhancement, I suggest we showcase these examples, perhaps under a new tab in the documentation titled "Simulating Data." Here is a quick working example of the trial level idea.
I can write this out along with a hierarchical example.