insongkim / PanelMatch

111 stars 34 forks source link

Potential bug of Moderator in PanelMatch #129

Closed Xueyun-Han closed 10 months ago

Xueyun-Han commented 11 months ago

Thanks for authors' work and genius idea for the amazing PanelMatch package.

The scenario in my research is a little complex. Firstly, every member in the treatment group got the policy change at different time. This made me must find some stagger DID method. Secondly, in my field most research also need to match control group members for a treatment group member using propensity score. When considering the lag history, a lot of research just use mean value to describe it. But in my case, an increasing trend and a decreasing trend with the same mean value represent totally different history and trend. The simple calculation of mean value would make totally wrong match in my case. Thirdly, although most robust check want a parallel trend test, I got some comments from conference presentation suggesting an alternative placebo test.

To solve the above problems, I tried almost all Stata packages mentioned anywhere for stagger DID. All these staggered DID packages cannot deal with the PSM challenge with different trend at the same time. Thanks for the PanelMatch to save my research. Different treatment time, PSM with trend consideration and Placebo test perfect support my research and imply a promising future. I used PanelMatch quite smoothly to do my data analysis. It worked perfectly in most cases. When I induced some moderator consideration, it works quite normal. However, when I classified the data in more specific setting, although it still works without moderator, when I set moderator variable in some specific setting, I encountered an error:

Error in lsets[lsets != 0] * (2): non-numeric argument to binary operator Traceback:

  1. PanelEstimate(sets = PM.results, data = data, df.adjustment = FALSE, . confidence.level = 0.95, moderator = moderator)
  2. lapply(set.list, FUN = panel_estimate, se.method = se.method, . number.iterations = number.iterations, df.adjustment = df.adjustment, . confidence.level = confidence.level, data = data, pooled = pooled)
  3. FUN(X[[i]], ...)
  4. prepareData(data.in = data, lead = lead, sets.att = sets.att, . sets.atc = sets.atc, continuous.treatment = continuous.treatment, . qoi.in = qoi, dependent.variable = dependent)
  5. getWits(lead = j, data = data.in, matched_sets = sets.att, continuous.treatment = continuous.treatment)
  6. pcs(matched_sets, lead, continuous.treatment = continuous.treatment)

After a thorough examination of potential error-inducing areas, I eventually identified the root cause by inspecting the return outcomes. The error stemmed from 'lsets' being either NA or NULL, which was related to the matched.set when the data was classified by moderator values into subsets. In my specific case, the 'ret.obj' within 'return(ret.obj)' in the 'handle_moderating_variable' function looked like this when corresponding to a moderator value:

$4 $att user_patient_id date matched.set.size n_row.......

$5 $att user_patient_id date matched.set.size n_row.......

$6 $att user_patient_id date matched.set.size 1 13672 151 0 2 13949 235 0 3 13995 385 0

For moderator value '6', all IDs had a matched.set.size of 0, leading to a cascade of errors as I struggled to pinpoint the issue. For moderator values '4' and '5', even when 'ret.obj' contained only one row, the code worked as expected, although it couldn't provide an estimate of ATT by period.

To address this issue, I introduced some conditional checks within the 'handle_moderating_variable' function in 'R/PE_lower_level.R':

The last two rows in the 'Source code' of 'handle_moderating_variable' are: names(ret.obj) <- as.character(as.vector(na.omit(moderating.values))) return(ret.obj)

I insert some conditional checks between them, they are:

for (i in valid.moderating.values.vector) {

for (i in moderating.values.vector) { character.i <- as.character(i) ret.obj.i <- ret.obj[[character.i]][[qoi.in]]

    summary.i <- summary(ret.obj.i)
    number.of.treated.units <- summary.i$number.of.treated.units
    num.units.empty.set <- summary.i$num.units.empty.set

    # Check whether all rows of matched.set.size are 0. 
    # If it is true, then drop the obj corresponding to that moderator value.
    if (number.of.treated.units == num.units.empty.set) {
        ret.obj[[character.i]]<- NULL
    }

}

After I modified this part of the code, all my left PanelMatch modules can work as usual.

With these modifications, all remaining PanelMatch modules resumed working as usual. Being more proficient in Python than in R, I might have misunderstood the source code's logic. While this solution worked for my specific case, I cannot guarantee its applicability to other scenarios.

Upon delving into the logic of the tscs paper (Imai, Kim, and Wang, 2021), I initially entertained the idea of creating a Python package for it. However, after thoroughly examining all the modules and their source code, I realized the complexity of the task and the extensive work required.

adamrauh commented 11 months ago

Thanks for raising this as well as for your thorough description of the issue! I will look into this soon. I know that at least one similar issue has been raised before...

adamrauh commented 10 months ago

Hi @Xueyun-Han , you were pretty much correct about the nature of the issue. Thanks so much for finding this. I patched the code in a slightly different manner than what you proposed, but I believe the package should behave more appropriately. Rather than throw an error, it simply removes the problematic moderator value from the list of returned results. The patches were applied in this commit. Please let us know if you have more problems/questions/etc.