kgoldfeld / simstudy

simstudy: Illuminating research methods through data generation
https://kgoldfeld.github.io/simstudy/
GNU General Public License v3.0
80 stars 7 forks source link

Best practice for testing probablistic functions (genMarkov testing) #157

Closed elinsooon closed 2 years ago

elinsooon commented 2 years ago

What is the best practice for testing whether a function that relies on probability works? Obviously I can set a seed and ensure that rows and columns match an identical data.table, but that doesn't seem like it's actually testing how the function works.

I'm looking at this specifically in regards to genMarkov, what might be a good game plan for testing that this function works?

kgoldfeld commented 2 years ago

This is a good question - I'll need to think about it.

kgoldfeld commented 2 years ago

One important property of a Markov chain (that doesn't have an absorbing state from which the process cannot deviate from, such as death) is that the amount of time spent in each state can be determined directly from the transition matrix (call it P). In particular, the state probabilities are , where k is a very large number. (Here's a presentation I found online that may or may not be helpful.)

So one idea for a possible check would be to generate a few long chains, look at the distribution of each with respect to the states, and compare that to the theoretical probabilities.

Other possibilities include (1) ensuring that the correct number of events is generated, and (2) that the number of categories matches the dimensions of the transition matrix.

If we could implement these three tests, I think that would take care of it.

elinsooon commented 2 years ago

I think I did the Pk thing, as well as (1). Could you clarify what you mean for (2)? My next step is going to do a check with non-wide tables with the same matrices to ensure all the data is correct in those by extension of the tests done on the wide genMarkovs

kgoldfeld commented 2 years ago

For (2), the different possible states in the data should be the same as the number of states implied by the transition matrix. So, if we have a 3x3 transition matrix implying 3 states, then the actual number of states observed in the data should also be three.

elinsooon commented 2 years ago

Ok got it. This logic implies that all potential states must be represented in the data, is that true? A matrix could be 4x4 but one of the states could be unreachable due to the probabilities in the matrix, so only 3 states would be observed. Is this too niche a case to consider?

kgoldfeld commented 2 years ago

Yes - that is a good observation. But if the transition matrix is well behaved (i.e. all the possible states are realistically attainable, so that the steady state probability exceeds 10%), and the chain length is long enough - say 250, then the probability that a particular state is never reached is vanishingly small. So, the key in the test is select a reasonable transition matrix and make sure the chain is long enough, there are enough individual chains, or both.