Simulating trial-level and hierarchical data

kiante-fernandez commented 8 months ago

I have been thinking about more general use cases, and one common thing I like to do is simulate data at the trial level or for multiple subjects. As an enhancement, I suggest we showcase these examples, perhaps under a new tab in the documentation titled "Simulating Data." Here is a quick working example of the trial level idea.

using SequentialSamplingModels
using Plots
using Random
using Distributions 
Random.seed!(2024)

#simulating trial level effects via regressions models

# Set up trial by trial parameters
intercepts = [0.3, 0.4, 0.5, 0.6]
x = rand(Uniform(1, 10), 1000) #generate sets of items values

v = (intercepts' .+ β .* x) #trail level drifts

A = 1.5
k = 0.5
τ = 0.0

Θ = hcat(v, repeat([A, k, τ]', 1000, 1))

num_rows, num_cols = size(Θ)

choices = Vector{Int64}(undef, num_rows)
rts = Vector{Float64}(undef, num_rows)

for i in 1:num_rows
    ν = Θ[i, 1:4]
    k = Θ[i, 5]
    A = Θ[i, 6]
    τ = Θ[i, 7]

    dist = RDM(;ν, k, A, τ)
    choice, rt = rand(dist, 1)

    choices[i] = choice[1]
    rts[i] = rt[1]
end

I can write this out along with a hierarchical example.

itsdfish commented 8 months ago

I think there are different ways to approach these types of models. Some time ago, Dominique made this example. Do you think there is an advantage of your approach?

itsdfish commented 8 months ago

FYI, you can simplify your code with choices[i], rts[i] = rand(dist).

DominiqueMakowski commented 8 months ago

That said, perhaps it would be a useful addition to have a data simulating function data_simulate(LBA; ...) to generate more complex data. This would make the package a one-stop solution for validation & testing. Such function should obviously be limited to a few options, but it could be at least stuff like:

n_conditions: this argument would generate groups of data (e.g., the effect of condition B vs. A as in the example), and would be associated with a vector of coefficients that represents the "effect"
- data_simulate(LBA; v=[1, 2], n_conditions=2, condition_effect=[[-1, 0.5]]): would generate a dataframe with RT and choice and a Condition column with 0 and 1; the drift parameters of condition 0 would be 1, 2 and that of condition 1 would be 0, 2.5 (+ the effect)
n_groups: this argument would generate groups of data (e.g., multiple participants) and would be associated with a parameters that drives the standard deviation between groups for the parameters
- data_simulate(LBA; v=[1, 2], n_groups=20, groups_deviation=[[0.1, 0.3]])

I think with such 2 options one could already cover a lot of scenarios and allow for easier constructions of even more complex designs

itsdfish commented 8 months ago

I understand the desire for such a function, but one problem with your proposal is that it is too restrictive for a package. From what I can tell, the function has the following limitations: (1) observations per condition are fixed to the same number, (2) the effect only applies to the drift rate, (3) the distributions over the parameters are hard coded, (4) the user is forced to use DataFrames, (5) it's not clear to me how it is composable (i.e. forms the basis for more complex scenarios). However, making a more general function would be much more challenging. I've considered numerous options in the past, but none have emerged as a satisfactory solution. My recommendation for the time being would be to define a function in the documentation example. It is quite useful in that narrow use case.

itsdfish commented 8 months ago

General code for hierarchical/multilevel data generation:

function sample_parms(group_dists, parm_names, cond_idx)
    parm_vals = map(d -> rand(d[cond_idx]), values(group_dists))
    return NamedTuple{parm_names}(parm_vals)
end

function sample_condition_parms(group_dists, parm_names, n_subjects, cond_idx)
    return [sample_parms(group_dists, parm_names, cond_idx) for s ∈ 1:n_subjects]
end

function sample_subject_parms(
    model::Type{<:SSM2D}, 
    group_dists, 
    n_subjects, 
)
    parm_names = keys(group_dists)

    return [sample_condition_parms(group_dists, parm_names, n_subjects, c) for c ∈ CartesianIndices(group_dists[1])]
end

function simulate_multilevel(
    model::Type{<:SSM2D}, 
    subject_parms, 
    n_subjects,
    n_trials; 
    factor_names = Tuple(map(s -> Symbol("factor_$s"), 1:ndims(group_dists[1]))),
    subj_start_index = 1
)
    n_total = sum(n_subjects .* n_trials)
    parm_names = keys(group_dists)
    choice = fill(0, n_total)
    condition = fill(0, n_total, ndims(group_dists[1])) 
    subject_id = fill(0, n_total)
    condition_indices = CartesianIndices(group_dists[1])
    rt = fill(0.0, n_total)
    cond_idx = 1
    cnt = 1
    subj_end_index = n_subjects + subj_start_index - 1
    for c ∈ condition_indices
        for s ∈ subj_start_index:subj_end_index
            parms = subject_parms[cond_idx][s]
            dist = model(; parms...)
            for t ∈ 1:n_trials[cond_idx] 
                choice[cnt], rt[cnt] = rand(dist)
                condition[cnt,:] .= c.I
                subject_id[cnt] = s
                cnt += 1
            end
        end
        cond_idx += 1
    end
    n_factors = length(factor_names) + 1
    factors = (factor_names[i] => condition[:,n_factors - i] for i in 1:length(factor_names))
    return (;factors..., subject_id, choice, rt)
end

function simulate_multilevel(
    model::Type{<:SSM2D}, 
    group_dists::NamedTuple, 
    n_subjects,
    n_trials; 
    factor_names = Tuple(map(s -> Symbol("factor_$s"), 1:ndims(group_dists[1]))),
    subj_start_index = 1
)
    subject_parms = sample_subject_parms(model, group_dists, n_subjects) 
    return simulate_multilevel(model, subject_parms, n_subjects, n_trials; factor_names, subj_start_index)
end

Example usage:

n_subjects = 50
n_trials = fill(1000, 4)
n_cond = (2,2)
group_dists = (
    ν = [
        MvNormal([1,1], [.2 0; 0 .2]) MvNormal([3,1], [.2 0; 0 .2]) 
        MvNormal([2,1], [.2 0; 0 .2]) MvNormal([4,1], [.2 0; 0 .2])  
    ],
    A = fill(Normal(.8, .2), n_cond),
    k = fill(Normal(.2, .1), n_cond),
    τ = fill(truncated(Normal(.2, .1), 0, Inf), n_cond)
)
data = simulate_multilevel(LBA, group_dists, n_subjects, n_trials)
# convert to dataframe if you would like 
df = DataFrame(data)

Example output:

20000×5 DataFrame
   Row │ factor_1  factor_2  subject_id  choice  rt       
       │ Int64     Int64     Int64       Int64   Float64  
───────┼──────────────────────────────────────────────────
     1 │        1         1           1       2  0.353357
     2 │        1         1           1       1  0.323123
     3 │        1         1           1       1  0.303511
     4 │        1         1           1       1  0.512396
   ⋮   │    ⋮         ⋮          ⋮         ⋮        ⋮
 19998 │        2         2          50       1  0.321626
 19999 │        2         2          50       1  0.528021
 20000 │        2         2          50       1  0.376962
                                        19993 rows omitted

combine(groupby(df, [:factor_1,:factor_2]), :choice => x -> mean(x .== 1))

Accuracies are correctly ordered:

4×3 DataFrame
 Row │ factor_1  factor_2  choice_function 
     │ Int64     Int64     Float64         
─────┼─────────────────────────────────────
   1 │        1         1          0.52608
   2 │        1         2          0.75488
   3 │        2         1          0.85032
   4 │        2         2          0.91176

TODO:

~~1. Consider separating generation of parameters from data generation, or also output parameters as separate NamedTuple.~~ ~~2. Generalize indexing of conditions for factorial designs.~~

itsdfish commented 8 months ago

@DominiqueMakowski, I think the code above checks most of the boxes. The example is a within subjects 2X2 factorial design which affects one of the drift rates. The setup works for any arbitrary within subjects design affecting any parameter. If you want to add a between subjects condition, you can call simulate_multilevel again and pass the appropriate inputs along with the keyword subj_start_index set to the next subject index. The output can be transformed to a DataFrame and the between subject conditions can be concatenated. You can optionally define names for the factors with the keyword factor_names. If you want the sampled parameters, you can generate those outside of simulate_multilevel with sample_subject_parms and pass the subject parameters to simulate_multilevel .

The main unknown is how the data works with Turing. Iterating over a flat data structure is convenient, but it is not always the most efficient. One option is to restructure the data in cases where speed is critical.

I need to make some tweaks to the design. There are a few snags to work through.

DominiqueMakowski commented 8 months ago

As always, you took a rough thought and you're turning that into something well-thought and designed, *hats off*

itsdfish commented 8 months ago

I appreciate the vote of confidence. I fixed one issue and realized there is another one. Basically, I need to figure how to handle conditions for vector parameters. For example, if nu = [2,1], how do I select only one of the elements to vary?

DominiqueMakowski commented 8 months ago

I'm not sure I understand, your question. In my mind if nu = [2, 1], you could set nu_group [1, 0] to get only the first accumulator to vary but in your example above if I understand both groups must be specified in absolute terms? i.e., not as a relative difference as in a regression-like approach

itsdfish commented 8 months ago

The code above works well for between subject designs, but adapting to for within subject designs is more challenging. S

The example above is a 2X2 between subject design where the distribution for the first drift rate is affected by the experimental manipulation. If I give you the parameters for the first individual in each group we have:

4-element Vector{@NamedTuple{ν::Vector{Float64}, A::Float64, k::Float64, τ::Float64}}:
 (ν = [0.8199582755225441, 1.2307055679022265], A = 0.709053974596606, k = 0.3823176118334306, τ = 0.30371883390220333)
 (ν = [1.1185011250253114, 1.4099358614685493], A = 1.248192968252697, k = 0.3924284083579383, τ = 0.07018350969172629)
 (ν = [2.0154737885699694, 0.24569280592047238], A = 0.8143113936695744, k = 0.29121205673366396, τ = 0.08830768540311051)
 (ν = [3.5445628997504466, 1.2792388125567997], A = 1.1797017598956194, k = 0.07595608876702825, τ = 0.10324520656785674)

These are four independent individuals from the four different groups. The first drift rate is sampled from different distributions, but the other parameters come from the same distribution. In a within subjects, design, the first individual would like this:

(ν = [0.8199582755225441, 1.2307055679022265], A = 0.709053974596606, k = 0.3823176118334306, τ = 0.30371883390220333)
(ν = [1.1185011250253114, 1.2307055679022265], A = 0.709053974596606, k = 0.3823176118334306, τ = 0.30371883390220333)
(ν = [2.0154737885699694, 1.2307055679022265], A = 0.709053974596606, k = 0.3823176118334306, τ = 0.30371883390220333)
(ν = [3.5445628997504466, 1.2307055679022265], A = 0.709053974596606, k = 0.3823176118334306, τ = 0.30371883390220333)

In other words, some parameters need to be fixed, and others need to vary. The challenge is doing it for scalars such as A and for vectors such as ν. I think it will be easy to support the between subject only design, but adding within subject design is more difficult, especially when dealing with mixed parameter types.

itsdfish / SequentialSamplingModels.jl

Simulating trial-level and hierarchical data #58