Error in if (inputs$clusterID == inputs$panelID) { : argument is of length zero

jonah-allen commented 2 weeks ago

First of all, thank you for developing a fantastic package! I am having and issue implementing clusters. Could you please provide insight into why this error occurs or suggest any adjustments to handle clustering properly? Thank you for your support!

Description

I encountered an issue when running the logitr() function with the clusterID parameter specified. While the model runs successfully without clusterID, adding it results in an error. The error message is:

Error in if (inputs$clusterID == inputs$panelID) { :
  argument is of length zero

I created a clusterID column specifically to ensure the data type in the version column was not the issue. (The scenario columns are not currently in use but may be in the future).

Reproducible Example

Here is a sample structure of my dataset (wtp_risk):

# A tibble: 180 × 12
   version  risk loss_value id         original_choice scenario_a scenario_b scenario_c alt   choice obsID clusterID
   <fct>   <dbl>      <dbl> <chr>      <fct>                <dbl>      <dbl>      <dbl> <fct>  <dbl> <int>    <int>
 1 CO1         8         0  R_bf8TOsp… c                        2          3          1 a          0    53        1
 2 CO1         4       113. R_bf8TOsp… c                        2          3          1 b          0    53        1
 3 CO1         4       113. R_bf8TOsp… c                        2          3          1 c          1    53        1
 4 CO1         8         0  R_2tnw7mp… a                        1          2          2 a          1    27        2
 5 CO1         4       113. R_2tnw7mp… a                        1          2          2 b          0    27        2
 6 CO1         4       113. R_2tnw7mp… a                        1          2          2 c          0    27        2

Code to Reproduce

# Run without clusterID (successful)
mnl_pref <- logitr(
  data = wtp_risk,
  outcome = "choice",
  obsID = "obsID",
  pars = c("loss_value", "risk")
)

# Run with clusterID (error)
mnl_pref <- logitr(
  data = wtp_risk,
  outcome = "choice",
  obsID = "obsID",
  pars = c("loss_value", "risk"),
  clusterID = "clusterID"
)

Observations

The wtp_risk data frame does not contain NA values in obsID, clusterID, or version.
The clusterID was created to ensure correct data type handling.
The error seems related to how panelID is internally processed in the logitr function, even when panelID is not specified. (This is not panel data).
I have tried setting panelID = NULL specifically and the error still occurs.

Environment

R version: 4.2.3
logitr version: 1.1.2
macOS: 15.0.1

jhelvy commented 2 weeks ago

Could you send a small portion of the data so I can replicate this error? Would only need the relevant variables used in the example: "choice", "obsID", "loss_value", "risk", "clusterID"

jonah-allen commented 2 weeks ago

sample_wtp_risk_data.csv

jonah-allen commented 2 weeks ago

Just realizing that only has cluster groups 1 and 2 included -- there are 25 cluster groups in my data....I think you can manually change those for replication but let me know if you need a different sample!

jhelvy commented 2 weeks ago

Okay I just ran this and I can replicate the error. It is perhaps a bug in the code, but I'm not sure if it should occur because I'm questioning the use of clusters here. Usually clustering is suggested when you have panel data. In your case, you have different versions. Is that just different versions of a choice experiment? If so then I'm not sure why you would want to cluster your errors around the version. Basically, I don't think clustering is needed.

If you do want to use clusters, then as a work around you can also set panelID = "clusterID" and it will work. With a MNL model there is no difference in the calculation of the log-likelihood with or without a panelID specified, so this will get you what you want without error. You just need to specify both clusterID and panelID like this:

m2 <- logitr(
  data      = wtp_risk,
  outcome   = "choice",
  obsID     = "obsID",
  pars      = c("loss_value", "risk"),
  clusterID = "clusterID",
  panelID   = "clusterID"
)

jonah-allen commented 2 weeks ago

Thanks very much, that fixed the issue!

Interesting -- my understanding was that clustering by the survey version is best practice because I have significant variation across survey versions; parameters vary (percent risk & percent profit loss) across three options, and the "source of risk" varies across half the surveys (half are viewed as "climate" and the other is "policy", without going into too much detail). I know that description is very general...but any resources you might be able to share on clustering in this case would be very appreciated!

jhelvy commented 2 weeks ago

I suppose that's a reasonable assumption. This is in general though an issue that I'll have to deal with because you should be able to use clusters without defining a panelID. This is a workaround for now, but I'll patch this.

jhelvy / logitr