utterances-bot commented 2 years ago

What To Do (And Not to Do) with Modeling Proportions/Fractional Outcomes | Robert Kubinec

Introduction Limited dependent variables, or continuous variables with lower and upper bounds, are quite common in the social sciences but do not fit easily with existing statistical models. In this Rmarkdown document, I show why these issues are important to consider when modeling your data, discuss existing R packages useful for fitting these models, and also present ordbetareg, an R package with a new variant of Beta regression that builds on and simplifies existing approaches (see paper here that is forthcoming in Political Analysis).

http://www.robertkubinec.com/post/limited_dvs/

briatte commented 2 years ago

This looks very promising, and the documented example here and in the preprint are very helpful. Thanks a lot!

I have a question, though: what about the approach developed by Ramalho in his frm, frmhet and frmpd packages, which seem to have been developed specifically to apply fractional regression to panel data?

It does not help that those packages were recently removed from CRAN, but they can still be accessed from the archive. Michael Clark's post on fractional regression also mentions the frm package and cites the accompanying papers.

saudiwin commented 2 years ago

Hi @briatte -

The "fractional regression" literature is identical to the "fractional logit" model I critique in the blog post. For example, if you look at the frm package (which, worryingly, has been taken off CRAN), you'd find the following in the help file for the frm command:

N <- 250
u <- rnorm(N)

X <- cbind(rnorm(N),rnorm(N))
dimnames(X)[[2]] <- c("X1","X2")

ym <- exp(X[,1]+X[,2]+u)/(1+exp(X[,1]+X[,2]+u))
y <- rbeta(N,ym*20,20*(1-ym))
y[y > 0.9] <- 1

This is the simulation that the package uses to create sample data to show frm should be used. What should be clear is that the data here is actually being simulated from the beta distribution using the rbeta command. Why is this? Because fractional logit is not a statistical distribution and so you cannot draw data from it.

If a package is simulating data from a different distribution, it makes much more sense to use that actual distribution to also model the data, i.e., using (ordered) beta regression as I discuss in my post. Otherwise mis-specification is literally built right into the estimates. I don't think this model is particularly relevant with the options we have today for beta regression, of which my ordered beta regression model deals with one of the last remaining issues (observations that are at the boundary).

saudiwin / saudiwin.github.io

post/limited_dvs/ #2

What To Do (And Not to Do) with Modeling Proportions/Fractional Outcomes | Robert Kubinec