Understanding the Model Fitting Process in Hesim

swaheera commented 1 year ago

Hello Dr. Incerti,

I was reading about your R package 'Hesim' and thought it was really interesting!

In particular, I was reading the following link (https://hesim-dev.github.io/hesim/articles/mlogit.html).

In the case of using multinomial logistic regression models for estimating transition probabilities of Discrete Markov Chains - Procedurally, I was trying to understand how this works.

Suppose there is a dataset with 3 States (State A, State B, State C). Each row in this dataset is an individual medical patient and contains some information on their covariates (e.g. height, weight, blood pressure, etc.), the state they were in and the state they transitioned to - my understanding is the following:

Isolate a subset of all rows where the initial state was State = State A
Fit a Multinomial Logistic Regression to this subset of rows (the input variables would be the covariates, and the response variable would be which "state" this patient transitioned to, e.g. "A", "B" or "C") - doing this will provide you with general equations to calculate the probability of anyone within the population transitioning to any of the 3 States based on their covariate vector. This will also tell you the effect of different covariates on the transition probability and if these effects were statistically significant
Repeat these two steps from the other two states (i.e. isolate the subset where initial state = State B, etc.) and fit a Multinomial Logistic Regression.
In the end, you will have a 3 x 3 transition matrix which equations (as provided above) that estimate the transition probabilities based on a given vector of covariates
Based on these transition probabilities, you can now perform standard calculations as is done with Markov Chains - for example, given an initial probability distribution vector, what is the probability that this Markov Chain will be State B after "k" iterations? We can run simulations and calculate these probabilities as well as the "spreads" of these probabilities, and time taken to "absorption" (e.g. if State C is a recurrent state, e.g. "Death")

Is my understanding of the above correct?

Your Help Is Greatly Appreciated, Thanks, S

dincerti commented 1 year ago

Hi @swaheera , apologies for the delayed response. Everything you wrote above is correct.

swaheera commented 1 year ago

@dincerti: thank you for your reply! I have been sick lately and have not been able to reply :(

I was just wondering - have you heard of the "msm" package in R? Would you say that your approach is similar to such approaches? I am interested in learning about how covariates can be used to model the transition probabilities of Discrete Markov Chains.

Thank you so much!

dincerti commented 1 year ago

@swaheera the msm package is great. We should how to use it to parameterize a model simulated with hesim in our preprint (see section 4.3 here).

swaheera commented 1 year ago

@dincerti : Thank you for your reply! I think there work you are doing is really great - a lot of people in the world are working with data that could really benefit from these kinds of models, but the documentation/software is not available or far too technical for the average user.

As an example, here is a problem that I am working on:

Suppose there is a system in which individuals are measured only at discrete time points (e.g. blood pressure, weight, is measured once every year) and these individuals have the ability to transition back-and-forth between multiple states (e.g. disease free, disease stage 1, disease stage 2, death by disease of interest , death by comorbidity, did not show up to the hospital that year, etc.) until one of the absorbing states is reached or the period of study is over. The patients also have the ability to re-enter the system if they have been absent for some time. The goal of the analysis is to understand what how different cohorts of patients and patient characteristics contribute to the transitions between these states.

I am trying to understand what types of Models are generally used for this type of problem.

At first I thought that perhaps the competing risk model (with time-varying covariates) might be suitable seeing as there are "competing absorption states" (e.g. death by disease of interest vs death by comorbidity) - but I am sure if R packages like "discSurv" currently allow for this. This approach typically does not allow for re-entry and assumes all non-initial states are absorbing. I thought perhaps I could modify the research question and only study transitions for patients who begin the study in the healthy state and see which absorbing state they eventually end up in.

I also thought of simply using the MSM R package (i.e. something similar to Cox-PH) and assume that my discrete times are continuous - but I am not sure if this is a good idea. For example, does the concept of a Q(t) rate matrix make sense when you have a discrete time Markov Chain? This will likely produce a "stepwise" hazard function - but I am not sure if this is mathematically logical.

Another approach that I have been considering is using several multinomial logistic regression (e.g. as described in the hesim R package). For example, if there are "n" states and "k" absorbing states (i.e. "n - k" non-absorbing states):

Isolate all rows of data in "state 1" and create a multinomial logistic regression model where the outcomes are "state 1, state 2, state 3... state n". Remember to include "time spent in state" as a covariate variable for the regression model
Repeat this process and create "n - k" multinomial logistic regression models

Thus, as a recap:

Approach 1: Modify the research question and use Discrete Competing Risks Model with Time Varying Covariates (not sure if an R package exists for this)
Approach 2: MSM with the assumption of Discrete Times being treated as Continuous Time (not sure if this assumption is a good idea)
Approach 3: Multinomial Logistic Regression (note: I also realize that this approach might be more flexible as if we consider "censoring" to be a "state" within the Markov Chain - we don't really have to worry about censoring as such. However, this approach will not have a Hazard Function). This approach seems to handle discrete times well?

I think the approach described within "hesim" (Approach 3) is a suitable approach for my problem - do you think this might be reasonable?

Thank you so much for everything!

Note: I had an idea to create an additional time covariate of "total time spent in system" and include this in the multinomial logistic regression (alongside "time spent in a specific state") - but I am not sure if doing this will violate some assumptions.

hesim-dev / hesim

Understanding the Model Fitting Process in Hesim #104

@dincerti : Thank you for your reply! I think there work you are doing is really great - a lot of people in the world are working with data that could really benefit from these kinds of models, but the documentation/software is not available or far too technical for the average user.