jhelvy / logitr

Fast estimation of multinomial (MNL) and mixed logit (MXL) models in R with "Preference" space or "Willingness-to-pay" (WTP) space utility parameterizations in R
https://jhelvy.github.io/logitr/
Other
42 stars 15 forks source link

Unexpected error when displaying summary #29

Closed Ales-G closed 2 years ago

Ales-G commented 2 years ago

Hello, i am running a standard multinomial model of this king.

mnl_pref <- logitr(
  data           = data,
  outcome        = "choice",
  obsID          = "tskID",
  pars           = c("wg","ls20","ls44","hs60","hs75"),
  modelSpace     = "pref",
  clusterID      = "wrkid",
  panelID        = "wrkid",
  numMultiStarts = 100)

the model is calculated correctly, but when I ask to summarize results I get the following error function

summary(mnl_pref)
Error in rowsum.default(expV, group = obsID, reorder = FALSE) : 
  incorrect length for 'group'

I am not sure what may be causing it. Any suggestions to work around the issue?

jhelvy commented 2 years ago

This type of message usually means there is something wrong with the obsID variable. That variable must be a continuously increasing numeric variable identifying the rows of each choice observation, e.g., 1,1,1,2,2,2,3,3,3, etc. Anything else will cause errors, e.g., 1,1,1,2,2,2,1,1,1,3,3,3 will cause an error because the 1 is repeated later. Can you double check your tskID variable?

Ales-G commented 2 years ago

So this is what the data looks like.

Screenshot 2022-05-19 at 11 41 26

I checked whether there is some error (more or less than two observations per choice task), but it appears to be correct.

> dat_transformed %>% 
+   group_by(tskID) %>% 
+   filter(n()>2) %>%summarise(n=n())
# A tibble: 0 × 2
# … with 2 variables: tskID <dbl>, n <int>

Another hint that may help understand if I am making an error, or if there is a problem with the command:

if I estimate the model with vcov=FALSE the model is correctly computed (though not displayed)

mnl <- logitr(
+   data           = dat_transformed,
+   outcome        = "choice",
+   obsID          = "tskID",
+   pars           = c("wg",myvars),
+   modelSpace     = "pref",
+   clusterID      = "wrkid",
+   panelID        = "wrkid",
+   numMultiStarts = 100)
Setting robust to TRUE since clusters are being used
Running multistart...
  Iterations: 100
  Cores: 7
Done!
> summary(mnl)
Error in rowsum.default(expV, group = obsID, reorder = FALSE) : 
  incorrect length for 'group'

However, if I try to include the vcov matrix in the model it gives an error while calculating the model.

> mnl_vcov <- logitr(
+   data           = dat_transformed,
+   outcome        = "choice",
+   obsID          = "tskID",
+   pars           = c("wg",myvars),
+   modelSpace     = "pref",
+   clusterID      = "wrkid",
+   panelID        = "wrkid",
+   vcov           = T,
+   numMultiStarts = 100)
Setting robust to TRUE since clusters are being used
Running multistart...
  Iterations: 100
  Cores: 7
Error in rowsum.default(expV, group = obsID, reorder = FALSE) : 
  incorrect length for 'group'

Do you still think is something that is related to the variable identifying the rows of each choice observation?

jhelvy commented 2 years ago

Okay so the vcov part makes sense in terms of identifying the source of the error. It must be in computing the variance-covariance matrix, which also makes sense that the error happens when you call summary() because that calls vcov() internally to compute the standard errors.

Can you do one more check - comment out the clusterID part and see if it still errors. This will help me understand if the error is coming from the computation of clustered standard errors or if it's something else.

Ales-G commented 2 years ago

Indeed you are correct. If I remove the clustered standard errors the command works fine! It is clearly there the issue!

jhelvy commented 2 years ago

Okay so that means this has to do with how the clustered errors are being computed. You should be able to estimate the model (with clustered errors) without any errors if you have vcov = FALSE (the default). Then if you call vcov(model) you'll get the error. Can you do that and post the message?

Ales-G commented 2 years ago

Apologies for the delaied reply. Here is the results and the error

> mnl_vcov <- logitr(
+    data           = dat_transformed,
+      outcome        = "choice",
+      obsID          = "tskID",
+      pars           = c("wg",myvars),
+      modelSpace     = "pref",
+      clusterID      = "wrkid",
+      panelID        = "wrkid",
+      numMultiStarts = 100)
Setting robust to TRUE since clusters are being used
Running multistart...
  Iterations: 100
  Cores: 7
Done!
> vcov(mnl_vcov)
Error in rowsum.default(expV, group = obsID, reorder = FALSE) : 
  incorrect length for 'group'

To not however that I have run the model using the robust standard errors and I get the same error.

Is this of any help?

jhelvy commented 2 years ago

Any chance this is also related to the data errors in #30? It looks like it might be based on the errors you're getting.

Ales-G commented 2 years ago

Unfortunately, I don't think so, I believe it has something to do with the SEs, because on this data the model works fine until I do not include robust or clustered standard errors. So I am not sure what to make of it. Any suggestion of things I could try out?

jhelvy commented 2 years ago

The issue is from computing robust standard errors. When doing this, the gradient is computed multiple times according to each cluster defined by the clusterID variable (in your example, wrkid). Perhaps there is something strange about that variable?

jhelvy commented 2 years ago

I believe this issue may also be related to the same bug in #30. This (I think) is resolved with the latest fixes in v0.7.1.

Ales-G commented 2 years ago

Dear jhelvy, thank you very much for all your work and for the numerous improvements to the package you made. I just wanted to let you know that even with the v0.7.1 I have this problem. If I estimate a model thus and I call for the summary I get the following error: Error in rowsum.default(expV, group = obsID, reorder = FALSE) : incorrect length for 'group'

mnl_pref <- logitr::logitr(
  data           = df,
  outcome        = "choice",
  obsID          = "tskID",
  pars           = c("cost",myvars),
  panelID        = "wrkid",
  robust         = T,
  numMultiStarts = 100)

Setting clusterID to 'wrkid' since robust == TRUE and a panelID is provided
Running multistart...
  Iterations: 100
  Cores: 7
Done!

summary(mnl_pref)

Error in rowsum.default(expV, group = obsID, reorder = FALSE) :  incorrect length for 'group'

Conversely, if I comment out the robust option errors the summary function works well and I obtain the results. I am not sure what is happening but I still believe there is an issue with the clustered standard errors.

In any case thanks a lot for the great progress.

jhelvy commented 2 years ago

Thanks for noting this - it's probably still a small bug somewhere in how I wrote the summary function when using clustered errors.

Could you try to make a similar example that creates this error using one of the package data sets, like the yogurt data? That would help a lot in identifying the source of the issue.

Ales-G commented 2 years ago

I tried playing around with it but I was not able to produce the error I have with the real data Compared to yogurt, my data is different because the obsID has gaps, and because not all of the ids have more than a task assigned.

data("yogurt")
random_binary <- function(n, p){
  # p is the proportion of 1s
  x <- c(rep(1, times=n * p), rep(0, times=n * (1 - p)))
  x[sample(length(x))] # or sample(x)
}
yogurt<- yogurt %>% mutate(ran1=random_binary(nrow(yogurt),0.5),
                           ran2=random_binary(nrow(yogurt),0.5),
                           ran3=random_binary(nrow(yogurt),0.5),
                           ran4=random_binary(nrow(yogurt),0.5),
                           ran5=random_binary(nrow(yogurt),0.5))

y1 <- yogurt %>% group_by(id) %>% slice(1:4)
y2 <- yogurt %>% group_by(id) %>% slice(5:8) %>% ungroup() %>% slice(1:200)
yog <- full_join(y1,y2) %>% arrange(obsID)

head(yog,15) # this is what my data looks like 
# A tibble: 15 × 12
# Groups:   id [2]
id obsID   alt choice price  feat brand    ran1  ran2  ran3  ran4  ran5
 1     1     1     1      0  8.1      0 dannon      1     0     1     1     1
 2     1     1     2      0  6.10     0 hiland      1     0     0     1     1
 3     1     1     3      1  7.90     0 weight      1     1     0     0     0
 4     1     1     4      0 10.8      0 yoplait     1     1     0     1     1
 5     1     2     1      1  9.80     0 dannon      1     0     0     0     0
 6     1     2     2      0  6.40     0 hiland      1     0     1     1     1
 7     1     2     3      0  7.50     0 weight      0     1     0     0     1
 8     1     2     4      0 10.8      0 yoplait     1     0     1     1     1
 9     2     9     1      0  9.80     0 dannon      1     0     0     1     1
10     2     9     2      0  5.00     0 hiland      0     0     1     1     0
11     2     9     3      0  7.90     0 weight      0     1     0     0     1
12     2     9     4      1 10.8      0 yoplait     0     0     1     0     1
13     2    10     1      0  9.80     0 dannon      0     1     1     0     0
14     2    10     2      0  5.00     0 hiland      0     1     0     0     0
15     2    10     3      0  7.90     0 weight      1     0     0     0     1

However estimating a model with robust standard errors on this data does not give errors.

mnl_pref <- logitr::logitr(
  data           = yog,
  outcome        = "choice",
  obsID          = "obsID",
  pars           = c("price", "feat", "brand","ran1","ran2","ran3","ran4","ran5"),
  panelID        = "id",
  robust         = T,
  clusterID      = "id",
  numMultiStarts = 100)
summary(mnl_pref)

Note that I tried to re-create the obsID variable for my real data to make it increase incrementally without gaps and I still experience the same error.

Note also that I obtain the same error in the estimation phase if I include vcov=T option

Ales-G commented 2 years ago

So I figured out how to create a small dataset that reproduces the same error

First create the following data

data <- structure(list(id = structure(c(1, 1, 8, 8, 12, 12, 13, 13, 
                                         13, 13, 14, 14, 16, 16, 42, 42, 47, 47, 51, 51, 53, 53, 55, 55, 
                                         57, 57, 62, 62, 63, 63, 65, 65, 69, 69, 71, 71, 77, 77, 78, 78, 
                                         79, 79, 83, 83, 84, 84, 90, 90, 95, 95, 95, 95, 96, 96, 100, 
                                         100, 102, 102, 104, 104, 107, 107, 108, 108, 108, 108, 112, 112, 
                                         113, 113, 114, 114, 118, 118, 130, 130, 132, 132, 139, 139, 139, 
                                         139, 141, 141, 146, 146, 147, 147, 147, 147, 153, 153, 167, 167, 
                                         169, 169, 170, 170, 176, 176), format.stata = "%12.0g"), tskID = structure(c(2, 
                                                                                                                      2, 23, 23, 35, 35, 37, 37, 39, 39, 40, 40, 48, 48, 126, 126, 
                                                                                                                      139, 139, 153, 153, 157, 157, 163, 163, 171, 171, 186, 186, 188, 
                                                                                                                      188, 193, 193, 205, 205, 213, 213, 230, 230, 233, 233, 235, 235, 
                                                                                                                      248, 248, 251, 251, 268, 268, 284, 284, 285, 285, 287, 287, 298, 
                                                                                                                      298, 306, 306, 311, 311, 320, 320, 322, 322, 323, 323, 335, 335, 
                                                                                                                      339, 339, 341, 341, 352, 352, 388, 388, 395, 395, 415, 415, 416, 
                                                                                                                      416, 421, 421, 438, 438, 439, 439, 440, 440, 459, 459, 501, 501, 
                                                                                                                      506, 506, 508, 508, 528, 528), format.stata = "%10.0g"), choice = structure(c(1, 
                                                                                                                                                                                                    0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 
                                                                                                                                                                                                    1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 
                                                                                                                                                                                                    1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 
                                                                                                                                                                                                    1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 
                                                                                                                                                                                                    0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0), format.stata = "%10.0g"), 
                     cost = structure(c(16.3, 15.6, 15.4, 15.1, 16.5, 16.4, 16.7, 
                                      15.6, 15.8, 16.5, 15.5, 15.9, 15.7, 16.4, 15.1, 15.9, 16.6, 
                                      15, 16.1, 16.3, 15.7, 17.9, 16.2, 15.3, 16.4, 16.5, 17.1, 
                                      17.3, 15.2, 16.1, 16.5, 16.9, 15, 16.8, 16.5, 17.4, 15.2, 
                                      15, 16.4, 15.4, 16.8, 16.1, 16.1, 15.4, 16.8, 17.8, 16.6, 
                                      17.6, 15.6, 15.7, 16.3, 16.3, 16, 17.3, 17.4, 17.9, 15, 17.6, 
                                      15.4, 16.2, 15.6, 17.4, 17.8, 15.7, 17.2, 17.1, 15.9, 17.5, 
                                      17, 16.4, 15.2, 16.3, 15.3, 15.2, 16.3, 15.6, 15.2, 15.7, 
                                      15.5, 15.7, 16.2, 15.6, 17.1, 16.5, 15.1, 16.8, 16.6, 16.3, 
                                      15.3, 15.9, 16.1, 16.3, 17.9, 15.9, 15.4, 16.1, 16.4, 15, 
                                      17.7, 15.8)), 
                     var1 = structure(c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 
                                            1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 
                                            0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 
                                            0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 
                                            1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 
                                            1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1)), 
                     var2 = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                           0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 
                                           0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 
                                           0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 
                                           1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 
                                           0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1)), 
                     var3 = structure(c(0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 
                                           1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 
                                           0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 
                                           0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 
                                           0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 
                                           0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0))), row.names = c(NA, 
                                                                                                                                                      -100L), class = c("tbl_df", "tbl", "data.frame"))

Second try to estimate the a MNL with robust SEs.

mnl_pref <- logitr::logitr(
  data           = data,
  outcome        = "choice",
  obsID          = "tskID",
  pars           = c("cost","var1","var2","var3"),
  panelID        = "id",
  robust         = T,
  numMultiStarts = 100)
Setting clusterID to 'id' since robust == TRUE and a panelID is provided
Running multistart...
  Iterations: 100
  Cores: 7
Done!
summary(mnl_pref)
Error in rowsum.default(expV, group = obsID, reorder = FALSE) : 
  incorrect length for 'group'

The command correctly estimates it, but it encounters problems with displaying the summary. Note that if you comment out the robust option the summary function works correctly!

Is there something I can do to overcome the problem?

jhelvy commented 2 years ago

I am almost certain that the issue is in how the ID variables are set. As it is currently written, the package needs IDs that are in sequentially increasing order, like 1,1,2,2,3,3, etc. The gaps in the IDs are probably the issue. You could test this by taking your demo dataset and over-writing the IDs with a sequentially ID, then see if it works. If that works, then we at least know the source of the problem.

This shouldn't be something that happens though in terms of how I would want the package to work. As long as the IDs are unique, being in a sequential order shouldn't matter, e.g. 1,1,8,8,13,13, should work just the same. So if that is indeed the problem, then I'll try to add a fix to make the package work even if the IDs have gaps.

Ales-G commented 2 years ago

I tried to follow your suggestion, but unfortunately it still gives the same error

data<- data %>%
  dplyr::group_by(id) %>%
  dplyr::mutate(group_id = cur_group_id())

mnl_pref <- logitr::logitr(
  data           = data,
  outcome        = "choice",
  obsID          = "tskID",
  pars           = c("cost","var1","var2","var3"),
  panelID        = "group_id",
  robust         = T,
  numMultiStarts = 100)
summary(mnl_pref)
Error in rowsum.default(expV, group = obsID, reorder = FALSE) : 
  incorrect length for 'group'

Is it possible that it also relates to the fact that some individuals perform a different number of tasks?

jhelvy commented 2 years ago

I believe I have found the source of the issue. I am working to add better checks to all of the variable IDs while also addressing this bug.

jhelvy commented 2 years ago

Okay, I think this is now corrected. Can you install the latest version (0.7.2) from Github and test it for me?

# install.packages("remotes")
remotes::install_github("jhelvy/logitr")

The issue has persisted this whole time and it simply had to do with the particular case where a cluster had only one observation. In these instances, when computing the clustered standard errors the data had only one row and was no longer being stored as a matrix, so I had to coerce it back to a matrix for the math operations to correctly execute.

Ales-G commented 2 years ago

Thanks you very much. This is great! It works perfectly now!