shabbychef / ohenery

Modeling of Ordinal Random Variables via Softmax
GNU Lesser General Public License v3.0
6 stars 0 forks source link

Allow observations weights per group #4

Open BenoitLondon opened 9 months ago

BenoitLondon commented 9 months ago

For example to discount old races, we could weight the likelihood for a group/race as implemented in lm/glm etc...

(It is different from the "censoring" weights you already have)

shabbychef commented 9 months ago

If I remember correctly, the weights input to harsm and hensm can be used as importance weights as well. The only caveat is that the weight for the last place participant in a group has no bearing on the outcome. At the very least, the following code runs:

library(ohenery)
data(best_picture)
best_picture %<>%
  mutate(place=ifelse(winner,1,2)) %>%
  mutate(weight=ifelse(winner,1,0)) %>%
  mutate(down_weight1=weight * ifelse(year < 1960,0.5,1)) %>%
  mutate(down_weight2=weight * ifelse(year < 1960,0,1))

fmla <- place ~ nominated_for_BestDirector + nominated_for_BestActor + nominated_for_BestActress + nominated_for_BestFilmEditing + Drama + Romance + Comedy

mod0 <- harsm(fmla,data=best_picture,group=year,weights=weight) 
mod1 <- harsm(fmla,data=best_picture,group=year,weights=down_weight1) 
mod2 <- harsm(fmla,data=best_picture,group=year,weights=down_weight2)

(Checking this for semantic correctness...)

shabbychef commented 9 months ago

At the very least, using zero weights for pre-1960 Oscar awards gives the same results as not including that data in the fit:

# check if 0 weights are akin to missing the whole group
library(ohenery)
data(best_picture)
best_picture %<>%
  mutate(place=ifelse(winner,1,2)) %>%
  mutate(weight=ifelse(winner,1,0)) %>%
  mutate(cutoff=ifelse(year < 1960,0,1)) %>%
  mutate(down_weight=weight * cutoff)

fmla <- place ~ nominated_for_BestDirector + nominated_for_BestActor + nominated_for_BestActress + nominated_for_BestFilmEditing + Drama + Romance + Comedy

# include the data but zero weight.
mod1 <- harsm(fmla,data=best_picture,group=year,weights=down_weight) 
# do not include the data.
mod2 <- harsm(fmla,data=best_picture %>% filter(cutoff > 0),group=year,weights=weight)
print(mod1)
print(mod2)

I get the same summaries for the two fits.

shabbychef commented 9 months ago

Hmm, I am not able to demonstrate that the weights really act as replication weights, although they are close:

# check if weights are really replication weights.
# give weight 1 to pre-1960 and weight 2 to 1960 and forward
# check if that is the same as including the post 1960 data twice.
library(ohenery)
data(best_picture)
best_picture %<>%
  mutate(place=ifelse(winner,1,2)) %>%
  mutate(weight=ifelse(winner,1,0)) %>%
  mutate(multiplier=ifelse(year < 1960,1,2)) %>%
  mutate(down_weight=weight * multiplier) %>%
  arrange(year,place)

# dupe it out;
bp <- bind_rows(best_picture,
    best_picture %>% 
    filter(multiplier > 1) %>%
    mutate(year=year+200)) %>%   # get the grouping distinct!
    arrange(year,place)

fmla <- place ~ nominated_for_BestDirector + nominated_for_BestActor + nominated_for_BestActress + nominated_for_BestFilmEditing + Drama + Romance + Comedy

# include the data and down weight
mod1 <- harsm(fmla,data=best_picture,group=year,weights=down_weight) 
# duplicate the data.
mod2 <- harsm(fmla,data=bp,group=year,weights=weight)
print(mod1)
print(mod2)

These give slightly different results, which is annoying.

BenoitLondon commented 9 months ago

Oh ok thank you, I didn't see they could be used like that as well. Maybe difference is due to scaling (if any) ?