harrelfe / Hmisc

Harrell Miscellaneous
Other
205 stars 81 forks source link

wtd.quantile #97

Open Stefan2015-5 opened 5 years ago

Stefan2015-5 commented 5 years ago

I noticed a very surprising result of the wtd.quantile function. If the weights are not integers, there seems to be an issue:

> require(Hmisc) > data <- c(1,2,3) > weights<- c(0.1, 0.1, 0.1) > wtd.quantile(data, weights=weights, probs=0.5) 50% 3 > weights<- c(0.1, 0.1, 0.1)* 10 > wtd.quantile(data, weights=weights, probs=0.5) 50% 2

harrelfe commented 5 years ago

This is definitely an error. Code fix welcomed.

Stefan2015-5 commented 5 years ago

Checking the function, I see it is assuming weights are integers, like nr of observations. I saw the function that is currently implemented can not handle (i) any non-integer numbers in weights correctly and (ii) not remove NAs despite the na.rm parameter. I think the easiest solution is to scale the weights up, a quick fix is here. I marked new lines with ##### new


 wtd.quantile<- function (x, weights = NULL, probs = c(0, 0.25, 0.5, 0.75, 1), 
           type = c("quantile", "(i-1)/(n-1)", "i/(n+1)", "i/n"), normwt = FALSE, 
           na.rm = TRUE)  {
   if (!length(weights))      return(quantile(x, probs = probs, na.rm = na.rm))
   type <- match.arg(type)
   if (any(probs < 0 | probs > 1))      stop("Probabilities must be between 0 and 1 inclusive")
   nams <- paste(format(round(probs * 100, if (length(probs) > 
                                               1) 2 - log10(diff(range(probs))) else 2)), "%", sep = "")

   if(na.rm & any(is.na(weights))){   ###### new
     i<- is.na(weights)
     x <- x[!i]
     weights <- weights[!i]
   }
   i <- weights == 0
   if (any(i)) {
     x <- x[!i]
     weights <- weights[!i]
   }
   if (type == "quantile") {
     if(sum(weights) < 1000000 ) {weights<- weights*1000000/sum(weights)}  ##### new 
     w <- wtd.table(x, weights, na.rm = na.rm, normwt = normwt, 
                    type = "list")
     x <- w$x
     wts <- w$sum.of.weights
     n <- sum(wts)
     order <- 1 + (n - 1) * probs
     low <- pmax(floor(order), 1)
     high <- pmin(low + 1, n)
     order <- order%%1
     allq <- approx(cumsum(wts), x, xout = c(low, high), method = "constant", 
                    f = 1, rule = 2)$y
     k <- length(probs)
     quantiles <- (1 - order) * allq[1:k] + order * allq[-(1:k)]
     names(quantiles) <- nams
     return(quantiles)
   }
   w <- wtd.Ecdf(x, weights, na.rm = na.rm, type = type, normwt = normwt)
   structure(approx(w$ecdf, w$x, xout = probs, rule = 2)$y, 
             names = nams)
 }
harrelfe commented 5 years ago

I apologize for not studying those more before my first reply. I think that in your original example you need to specify normwt=TRUE when weights do not add up to n. When you do that you get the proper median of 2.0. wtd.table handles fractional weights properly, so I don't think your changes are needed other than the more accurate way of dealing with na.rm.

Stefan2015-5 commented 5 years ago

No, even with normwt=TRUE the function produces incorrect results. (On top of that I can't imagine a useful interpretation of the result when normwt=FALSE)

 require(Hmisc)
 data <- 1:3
 weights<- c(0.1, 0.1, 0.8)
 probs=0.09

 wtd.quantile(x=data, weights=weights, probs=probs, normwt=TRUE) #=3 ; incorrect. 
 wtd.quantile2(x=data, weights=weights, probs=probs)  #=1; correct. this uses the modified function I posted above. 

The issue is the one I outlined above. I suggest to drop the parameter normwt, or keep it just for backward compatability without use.

harrelfe commented 5 years ago

normwt=FALSE is very seldom used but there is a legitimate use when weights represent actual case weights, i.e., an observation with a weight of 3 actually was sampled 3 times.

If wtd.quantile fails, then either wtd.table or wtd.ecdf must be the culprit. Scaling weights up is not the ultimate solution and is not scalable to massive problems.

Stefan2015-5 commented 5 years ago

If normwt=FALSE the code produces the very same results as normwt=TRUE except in some cases where either normwt=FALSE fails or both fail. To convince me otherwise please provide a counter example. But even if: "very seldom" is already making it clear: it should certainly not be the standard value.

"If wtd.quantile fails, then either wtd.table or wtd.ecdf must be the culprit" What makes you think that? From what I can see wtd.table behaves just fine and wtd.ecdf is not even called in the incorrect examples. Again: the calcs of the inside function objects order, low and high are assuming weights to be positive integers (under normwt=FALSE and normwt= TRUE). I agree that "scaling up is not the ultimate solution" but if you do not want the whole function changed, it is the only I am seeing.

BTW: the function also does not handle negative weights (neither gives a warning, error or provides a meaningful result). I think the best would be to drop these together with zero weights.

harrelfe commented 5 years ago

I agree re: negative weights. And I agree in principle about changing the default for normwt, but it's too late to do that now.

wtd.table nowhere assumes integer weights, as 'frequencies' are sums of weights. I suspect the problem is more related to the rules that are used for computing quantiles when the quantile being requested is a a value where there are ties in the data, or the wtd.table method is not interpolated correctly to give the requested quantile.

iii-org-tw commented 4 years ago

Any update on this?

I find the following case to have unexpected result.

>>> x <- c(1,2)
>>> weights <- c(1,9)
>>> wtd.quantile(x, weights=weights, normwt=TRUE, prob=0)
0%
2

I believe I can help fixing.

Stefan2015-5 commented 4 years ago

The functions are broken and the maintainer undiscerning (see thread). To me the package is dead.  Use my functions instead, where I fixed the issues. 

wtd.quantile<- function (x, weights = NULL, probs = c(0, 0.25, 0.5, 0.75, 1),
                         type = c("quantile", "(i-1)/(n-1)", "i/(n+1)", "i/n"),
                         na.rm = TRUE)  {
  # Function taken from HMISC, but issue solved which is documented here: https://github.com/harrelfe/Hmisc/issues/97#issuecomment-429634634
  normwt = FALSE
  if (!length(weights))      return(quantile(x, probs = probs, na.rm = na.rm))
  type <- match.arg(type)
  if (any(probs < 0 | probs > 1))      stop("Probabilities must be between 0 and 1 inclusive")
  nams <- paste(format(round(probs * 100, if (length(probs) >
                                              1) 2 - log10(diff(range(probs))) else 2)), "%", sep = "")

  if(na.rm & any(is.na(weights))){   ###### new
    i<- is.na(weights)
    x <- x[!i]
    weights <- weights[!i]
  }
  i <- weights <= 0         # nwe: kill negative and zero weights and associated data
  if (any(i)) {
    x <- x[!i]
    weights <- weights[!i]
  }
  if (type == "quantile") {
    if(sum(weights) < 1000000 ) {weights<- weights*1000000/sum(weights)}  ##### new
    w <- wtd.table(x, weights, na.rm = na.rm, normwt = normwt,
                   type = "list")
    x <- w$x
    wts <- w$sum.of.weights
    n <- sum(wts)
    order <- 1 + (n - 1) * probs
    low <- pmax(floor(order), 1)
    high <- pmin(low + 1, n)
    order <- order%%1
    allq <- approx(cumsum(wts), x, xout = c(low, high), method = "constant",
                   f = 1, rule = 2)$y
    k <- length(probs)
    quantiles <- (1 - order) * allq[1:k] + order * allq[-(1:k)]
    names(quantiles) <- nams
    return(quantiles)
  }
  w <- wtd.Ecdf(x, weights, na.rm = na.rm, type = type, normwt = normwt)
  structure(approx(w$ecdf, w$x, xout = probs, rule = 2)$y,
            names = nams)
}

wtd.table<- function (x, weights = NULL, type = c("list", "table"), normwt = FALSE,
                      na.rm = TRUE) {
  # Function taken from HMISC, but issue solved which is documented here: https://github.com/harrelfe/Hmisc/issues/97#issuecomment-429634634
  p_load(timeDate)
  type <- match.arg(type)
  if (!length(weights))
    weights <- rep(1, length(x))
  isdate <- ( class(x)[1]=="Date"  | class(x)[1]=="POSIXct") ### MODIFIED
  ax <- attributes(x)
  ax$names <- NULL
  if (is.character(x))
    x <- as.factor(x)
  lev <- levels(x)
  x <- unclass(x)
  if (na.rm) {
    s <- !is.na(x + weights)
    x <- x[s, drop = FALSE]
    weights <- weights[s]
  }
  n <- length(x)
  if (normwt)
    weights <- weights * length(x)/sum(weights)
  i <- order(x)
  x <- x[i]
  weights <- weights[i]
  if (anyDuplicated(x)) {
    weights <- tapply(weights, x, sum)
    if (length(lev)) {
      levused <- lev[sort(unique(x))]
      if ((length(weights) > length(levused)) && any(is.na(weights)))
        weights <- weights[!is.na(weights)]
      if (length(weights) != length(levused))
        stop("program logic error")
      names(weights) <- levused
    }
    if (!length(names(weights)))
      stop("program logic error")
    if (type == "table")
      return(weights)
    #x <-  all.is.numeric(names(weights), "vector")#  modified: commented out. function checked whether all are numeric and if yes returned the weights
    if (isdate)      attributes(x) <- c(attributes(x), ax)
    x_out<- as.numeric(names(weights))
    names(weights) <- NULL
    return(list(x = x_out, sum.of.weights = weights))
  }
  xx <- x
  if (isdate)
    attributes(xx) <- c(attributes(xx), ax)
  if (type == "list")
    list(x = if (length(lev)) lev[x] else xx, sum.of.weights = weights)
  else {
    names(weights) <- if (length(lev))
      lev[x]
    else xx
    weights
  }
}

wtd.mean <- function (x, weights = NULL, normwt = "ignored", na.rm = TRUE) {
  # Function taken from HMISC, but issue solved which is documented here: https://github.com/harrelfe/Hmisc/issues/97#issuecomment-429634634
  if (!length(weights))
    return(mean(x, na.rm = na.rm))
  if (na.rm) {
    s <- !is.na(x + weights)
    x <- x[s]
    weights <- weights[s]
  }
  sum(weights * x)/sum(weights)
}
dylanbeaudette commented 3 years ago

@harrelfe Has this issue been addressed in the latest version of Hmisc? The wtd.quantile function is pretty important to the work we do. Thanks for this package and rms!

lrestrepo0001 commented 3 years ago

I'm using Hmisc to generate weighted boxplots with specific quantiles (.05, .25,.5,.75,.95), and I can't get it to to match with the output of STATA's fweight feature. Do you have any suggestions for creating a comparable function?

Paul-SO commented 3 years ago

@Stefan2015-5 @harrelfe Hello :) any news about this?

Stefan2015-5 commented 3 years ago

as already stated: "The functions are broken and the maintainer undiscerning (see thread). To me the package is dead." You are welcome to use my functions instead, where I fixed the issues.