tdhock / aum

1 stars 2 forks source link

vignette about asymptotic complexity of line search #5

Open tdhock opened 1 year ago

tdhock commented 1 year ago

after going through a few iterations of the first for loop in https://github.com/tdhock/aum/blob/main/vignettes/line-search.Rmd I executed this code

  N.seq <- as.integer(10^seq(2,log10(max(diff.list$subtrain$example)),l=10))
  N.seq <- as.integer(10^seq(log10(100),log10(1600),l=10))
  atime.list <- atime::atime(
    N=N.seq,
    setup={
      maxIterations <- N*(N-1)/2
      X.subtrain <- X.keep[index.list$subtrain,]
      X.sub <- X.subtrain[1:N,]
      diff.sub <- diff.list$subtrain[example %in% seq(0,N-1)]
    },
    result=TRUE,
    seconds.limit=1,
    times=1,
    aum_line_search={
      nb.weight.search <- aum::aum_line_search(
        diff.sub,
        maxIterations=maxIterations,
        feature.mat=X.sub,
        weight.vec=weight.vec)
      list(
        total.iterations=nrow(nb.weight.search$line_search_result),
        iterations.to.min=nb.weight.search$line_search_result[,which.min(aum)])
    })
  atime.list$measurements[, total.iterations := sapply(
    result, function(L)L$total.iterations)]
  atime.list$measurements[, iterations.to.min := sapply(
    result, function(L)L$iterations.to.min)]

  best.list <- atime::references_best(atime.list, unit.col.vec=c("total.iterations","iterations.to.min"))
best.ref <- best.list$ref[each.sign.rank==1]
  library(ggplot2)
  gg <- ggplot()+
    theme_bw()+
    facet_grid(unit ~ ., scales="free")+
    geom_line(aes(
      N, empirical, color=expr.name),
      data=best.list$meas)+
    scale_x_log10()+
    scale_y_log10("median line, min/max band")
  gg.show <- gg+
     directlabels::geom_dl(aes(
       N, empirical, color=expr.name, label=expr.name),
       method="right.polygons",
       data=best.list$meas)+
     theme(legend.position="none")+
     coord_cartesian(xlim=c(min(N.seq),max(best.list$meas$N)*5))
  gg.ref <- gg.show+
    geom_line(aes(
      N, reference, group=paste(fun.name, expr.name)),
      color="grey",
      data=best.ref)
    gg.ref+
      directlabels::geom_dl(aes(
        N, reference,
        label.group=paste(fun.name, expr.name),
        label=fun.name),
        data=best.ref,
        color="grey",
        method="left.polygons")

and I got this plot image which suggests that the number of iterations in line search is quadratic, and so is the number of iterations to get to the min. @phase Would be nice to have a vignette that explores this more systematically,

tdhock commented 1 year ago

I did an analysis of all the neuroblastoma-data using the new code in #6 (keep doing more line search iterations until AUM increases) and I observed that the number of iterations that takes is quadratic in the number of input breakpoints/lines. So probably too slow for a vignette on CRAN, closing. image source: https://github.com/tdhock/max-generalized-auc/blob/master/figure-line-search-complexity.R

tdhock commented 1 year ago

previous plot was "keep doing more iterations of line search while subtrain aum is decreasing." what would the plot look like if we did validation aum instead of subtrain?

tdhock commented 1 year ago

data for "keep doing more iterations of line search while subtrain aum is decreasing" here https://github.com/tdhock/max-generalized-auc/blob/master/figure-line-search-complexity.csv can we add the total number of iterations of approx/constant line search? (rather than keep going line search) it should be linear (smaller slope)

tdhock commented 1 year ago

actually, even the approx line search (exactL, linear number of iterations of exact line search algorithm) does a quadratic number of iterations, same as min.aum (keep doing more iterations while subtrain aum is decreasing), see below: image To explain the result above, we can examine the number of steps of gradient descent, which is larger for exactL and smaller for exactQ (quadratic number of iterations, full exact line search algorithm), and smaller for min.aum, see below: image The overall timings (including overhead of R memory allocation etc) are shown below, and suggest that the aum.min method is slightly faster, but all three methods are about the same, image image source code: https://github.com/tdhock/max-generalized-auc/blob/9574892ed8204771cef360d06756a5aacecd5e99/figure-line-search-complexity-compare.R Also max validation AUC is about the same between methods, see below, image There is a slight increase of AUC for min.aum/exactQ over exactL.

tdhock commented 1 year ago

On this data set, init=zero gets larger valid AUC than init=IRCV. And for IRCV we see that maxIterations=min.aum is consistently better than grid search. image

phase commented 1 year ago

Here are some graphs from my tests. I think I've reproduced exactL taking a large amount of steps of gradient descent. image This makes me wonder if doing the full quadratic amount of iterations and then checking a few grid points would improve hybrid.

image image

tdhock commented 1 year ago

THanks for sharing, those results look consistent. "This makes me wonder if doing the full quadratic amount of iterations and then checking a few grid points would improve hybrid." -> checking a few grid points would not help quadratic because the quadratic already checks all possible step sizes.

phase commented 1 year ago

oh yes - not sure what I was thinking!

I ran some more tests with different hybrid variants and got similar results

image

image image