Dirichlet sampling within rdirmultinom

I could be wrong, but should the rdirichlet part here in the rdirmultinom function be pulled out of the loop? By sampling from the Dirichlet on every observation, the average works out to the correct proportions (even with low phi values) and you lose the degraded sample-size effect. Here's a graphical demo:

rdirichlet <- function(p, phi) {
  alpha <- p * phi
  obs <- rgamma(length(p), alpha, 1)
  obs / sum(obs)
}

rdirmultinom_wham <- function(N, p, phi) {
  obs <- rep(0, length(p))
  for (i in seq_len(N)) {
    dp <- rdirichlet(p, phi) # should be outside the loop?
    obs <- obs + rmultinom(n = 1, prob = dp, size = 1)
  }
  obs
}

rdirmultinom2 <- function(N, p, phi) {
  dp <- rdirichlet(p, phi)
  rmultinom(n = 1, prob = dp, size = N) # or add up 1 by 1 as above with dp outside the loop
}

sim_rdirmult <- function(N, phi, type = c("wham", "2"), p = c(0.1, 0.1, 0.4, 0.4)) {
  type <- match.arg(type)
  out <- matrix(ncol = length(p), nrow = 100, data = 0)
  if (type == "wham") {
    for (i in seq_len(nrow(out))) out[i, ] <- rdirmultinom_wham(N, p, phi)
  } else {
    for (i in seq_len(nrow(out))) out[i, ] <- rdirmultinom2(N, p, phi)
  }
  plot(1, ylim = c(1, ncol(out)), xlim = c(1, nrow(out)), type = "n")
  out <- out / N
  for (i in seq(1, nrow(out))) {
    symbols(rep(i, ncol(out)), seq_along(out[1, ]),
      circles = out[i, ] * 4, inches = FALSE, add = TRUE
    )
  }
}

# look nearly identical because the Dirichlet is resampled for every observation:
sim_rdirmult(100, 0.1)

sim_rdirmult(100, 1e3)

# with this version, they look different:
sim_rdirmult(100, 0.1, type = "2")

sim_rdirmult(100, 1e3, type = "2")

get_neff <- function(N, phi) {
  (N + N * phi) / (N + phi)
}

# these should be about the same but aren't here:
sim_rdirmult(300, 40)

sim_rdirmult(get_neff(300, 40), 1e9)

# with this version they approximately match:
sim_rdirmult(300, 40, type = "2")

sim_rdirmult(get_neff(300, 40), 1e9, type = "2")

^{Created on 2021-09-09 by the reprex package (v2.0.0)}

Sean, yes, this is a coding error. Sums of beta-binomials or dirichlet-multinomials do not also have the same distribution. Thank you for finding it. We'll add it to the list. Fortunately, this simulation feature hasn't really been used yet as far as I know.

On Thu, Sep 9, 2021 at 2:11 PM Sean Anderson @.***> wrote:

I could be wrong, but should the rdirichlet part here https://github.com/timjmiller/wham/blob/0c53be0043ec67963b1ae1b1519ac5004a68b8b2/src/helper_functions.hpp#L144 in the rdirmultinom function be pulled out of the loop? By sampling from the Dirichlet on every observation, the average works out to the correct proportions (even with low phi values) and you lose the degraded sample-size effect. Here's a graphical demo:

rdirichlet <- function(p, phi) { alpha <- p phi obs <- rgamma(length(p), alpha, 1) obs / sum(obs) } rdirmultinom_wham <- function(N, p, phi) { obs <- rep(0, length(p)) for (i in seq_len(N)) { dp <- rdirichlet(p, phi) # should be outside the loop? obs <- obs + rmultinom(n = 1, prob = dp, size = 1) } obs } rdirmultinom2 <- function(N, p, phi) { dp <- rdirichlet(p, phi) rmultinom(n = 1, prob = dp, size = N) # or add up 1 by 1 as above with dp outside the loop } sim_rdirmult <- function(N, phi, type = c("wham", "2"), p = c(0.1, 0.1, 0.4, 0.4)) { type <- match.arg(type) out <- matrix(ncol = length(p), nrow = 100, data = 0) if (type == "wham") { for (i in seq_len(nrow(out))) out[i, ] <- rdirmultinom_wham(N, p, phi) } else { for (i in seq_len(nrow(out))) out[i, ] <- rdirmultinom2(N, p, phi) } plot(1, ylim = c(1, ncol(out)), xlim = c(1, nrow(out)), type = "n") out <- out / N for (i in seq(1, nrow(out))) { symbols(rep(i, ncol(out)), seq_along(out[1, ]), circles = out[i, ] 4, inches = FALSE, add = TRUE ) } }

look nearly identical because the Dirichlet is resampled for every observation:

sim_rdirmult(100, 0.1)

https://camo.githubusercontent.com/a5b24345767227e41218b474c2e10bc32248fff73840ad87e1eedd569a72c9b8/68747470733a2f2f692e696d6775722e636f6d2f356f61346d4a492e706e67

sim_rdirmult(100, 1e3)

https://camo.githubusercontent.com/c047a4b23ff8329dd14c4a4e3dd365c795e43655e15048461620d5b5f8a9153b/68747470733a2f2f692e696d6775722e636f6d2f5264376d7478762e706e67

with this version, they look different:

sim_rdirmult(100, 0.1, type = "2")

https://camo.githubusercontent.com/efd6bcc87296abb12517497787dcc08f7c4d68d70e8a2a9d8e364e5c4c0cb928/68747470733a2f2f692e696d6775722e636f6d2f565347523356652e706e67

sim_rdirmult(100, 1e3, type = "2")

https://camo.githubusercontent.com/8447d76a0973e65185719e8c84d756ddadb7a209c04dc2f35f5068b6e13d6f4e/68747470733a2f2f692e696d6775722e636f6d2f7a616a457643582e706e67

get_neff <- function(N, phi) { (N + N * phi) / (N + phi) }

these should be about the same but aren't here:

sim_rdirmult(300, 40)

https://camo.githubusercontent.com/06121de3e536435c1593588a0b8bdc9603c10b30596f23925d4a203e20e2cdde/68747470733a2f2f692e696d6775722e636f6d2f7342324d6867462e706e67

sim_rdirmult(get_neff(300, 40), 1e9)

https://camo.githubusercontent.com/e6fa4b0a5763b7cab95a9dbc7b70dfde0091a21e678dbccc343d5a65fcf8ac8c/68747470733a2f2f692e696d6775722e636f6d2f5478794a3754582e706e67

with this version they approximately match:

sim_rdirmult(300, 40, type = "2")

https://camo.githubusercontent.com/87dbbfd5fa879d1863262a8d162ae927ffeb1e6c4222e292eab6da464274ef66/68747470733a2f2f692e696d6775722e636f6d2f6e33346769376c2e706e67

sim_rdirmult(get_neff(300, 40), 1e9, type = "2")

https://camo.githubusercontent.com/9f7cfbc68736ccec45dce798505d85e0fb52199662380d21978d41889fe23598/68747470733a2f2f692e696d6775722e636f6d2f473754516249552e706e67

Created on 2021-09-09 by the reprex package https://reprex.tidyverse.org (v2.0.0)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/timjmiller/wham/issues/49, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIGN7DDAMZRFOVCVKQHE3TUBD2FDANCNFSM5DXWGAGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Timothy J. Miller, PhD (he, him, his) Research Fishery Biologist NOAA, Northeast Fisheries Science Center Woods Hole, MA 508-495-2365

timjmiller / wham

Dirichlet sampling within rdirmultinom #49

look nearly identical because the Dirichlet is resampled for every observation:

with this version, they look different:

these should be about the same but aren't here:

with this version they approximately match: