steve-the-bayesian / BOOM

A C++ library for Bayesian modeling, mainly through Markov chain Monte Carlo, but with a few other methods supported. BOOM = "Bayesian Object Oriented Modeling". It is also the sound your computer makes when it crashes.
GNU Lesser General Public License v2.1
35 stars 14 forks source link

bsts hangs with MonthlyAnnualCycle and short data #45

Closed davidtedfordholt closed 4 years ago

davidtedfordholt commented 4 years ago

I've found a couple of series that can be modeled with only a monthly-annual cycle component and as little as 34 days, but a series with too few days seems to get stuck in a loop. The following example hangs when trying to model 60 days.

library(bsts)

data <- zoo(runif(100, 0, 10), seq.Date(from = as.Date("2020-01-01"), by = 1, length.out = 100))

for (i in length(data):1) {
  print(i)
  state <- AddMonthlyAnnualCycle(state.specification = list(), y = data[1:i])
  bsts(data[1:i], state.specification = state, niter = 10, ping = 0)
}

It could be as simple as a test for length of y in the AddMonthlyAnnualCycle() function, but because I've found spots where it didn't hang until there were even fewer observations, I'm unsure. I assume it might be a matter of needing to have the entirety of at least one full calendar month, with a day on either side? If so, the test would need to be for that.

davidtedfordholt commented 4 years ago

If the right answer is to deal with it in the interface, I'm happy to write it, if you'd like. If it's in the C++, I'm far less useful.

davidtedfordholt commented 4 years ago

After testing, it looks like it does require a full calendar month and one day before and after to that calendar month. The following runs:

library(bsts)
data <- zoo(runif(100, 0, 10), seq.Date(from = as.Date("2019-12-31"), by = 1, length.out = 33))
state <- AddMonthlyAnnualCycle(state.specification = list(), y = data)
bsts(data, state.specification = state, niter = 10, ping = 0)

Shortening the data by one observation on either end will cause it to hang. The following logical statement tests this on a tibble of data (with the time index column index):

!tbl_data %>%
  dplyr::count(month = lubridate::floor_date(index, unit = "month")) %>%
  dplyr::mutate(
    days_in_month = lubridate::days_in_month(month),
    test = ifelse(n == days_in_month & dply::lag(n) > 0 & dply::lead(n) > 0, TRUE, FALSE)
  ) %>%
  dplyr::pull(test) %>%
  any(na.rm = TRUE)

(I realize that it relies on dplyr, lubridate, and tibble, but since those are all already loaded in fable.bsts it isn't additional overhead. I simply haven't thought through a method using base R or just lubridate)

steve-the-bayesian commented 4 years ago

Hi David, If you can screen for this in the R code and present an appropriate warning then please send me updated code and I'll put it in the next release. Otherwise I'll add this to the pile and get to it when I can. That's less and less frequent these days, I'm afraid. :( Thanks for the bug report in any case. Steve

On Tue, Jan 28, 2020 at 7:37 AM David notifications@github.com wrote:

After testing, it looks like it does require a full calendar month and one day before and after to that calendar month. The following runs:

library(bsts) data <- zoo(runif(100, 0, 10), seq.Date(from = as.Date("2019-12-31"), by = 1, length.out = 33)) state <- AddMonthlyAnnualCycle(state.specification = list(), y = data) bsts(data, state.specification = state, niter = 10, ping = 0)

Shortening the data by one observation on either end will cause it to hang. The following logical statement tests this on a tibble of data:

!tbl_data %>% dplyr::count(month = lubridate::floor_date(index, unit = "month")) %>% dplyr::mutate( days_in_month = lubridate::days_in_month(month), test = ifelse(n == days_in_month & dply::lag(n) > 0 & dply::lead(n) > 0, TRUE, FALSE) ) %>% dplyr::pull(test) %>% any(na.rm = TRUE)

(I realize that it relies on dplyr, lubridate, and tibble, but since those are all already loaded in fable.bsts it isn't additional overhead. I simply haven't thought through a method using base R or just lubridate)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/steve-the-bayesian/BOOM/issues/45?email_source=notifications&email_token=ABMVDVNFZPHJKQ6LUM4CUI3RABGLPA5CNFSM4KMTGVNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKDX4NI#issuecomment-579305013, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMVDVMZUOGCOC2LNVB4CEDRABGLPANCNFSM4KMTGVNA .

davidtedfordholt commented 4 years ago

No problem, Steve. I'll get on it this weekend. Happy to help! Would you prefer that I fork and submit a PR, or just send you replacement code for AddMonthlyAnnualCycle()? (I'm happy to do the former if it doesn't burden, as I'm slowly forcing myself to accept good CS practices) David

steve-the-bayesian commented 4 years ago

I should probably also adopt good SWE practices when it comes to the boom repo. I'm the sole contributor. I could add you as a contributor as well. Please make a testthat unit test case for any changes you make (ideally one that would have surfaced the current error). Any test cases must be small because they are run by CRAN under some crowded conditions.

Thanks for pitching in!

On Wed, Jan 29, 2020 at 10:39 AM David notifications@github.com wrote:

No problem, Steve. I'll get on it this weekend. Happy to help! Would you prefer that I fork and submit a PR, or just send you replacement code for AddMonthlyAnnualCycle()? (I'm happy to do the former if it doesn't burden, as I'm slowly forcing myself to accept good CS practices) David

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/steve-the-bayesian/BOOM/issues/45?email_source=notifications&email_token=ABMVDVOEVNT67QX7WL24JKLRAHEP5A5CNFSM4KMTGVNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKIJCRY#issuecomment-579899719, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMVDVJK2W25E2AALC2XHI3RAHEP5ANCNFSM4KMTGVNA .

davidtedfordholt commented 4 years ago

I can just set.seed() on the example above for a test case, with a low niter. I'll get in together this weekend. Thanks for all the help you've given me!

steve-the-bayesian commented 4 years ago

Cool. Think about what you might want to 'expect' in the testthat framework.

On Wed, Jan 29, 2020 at 12:08 PM David notifications@github.com wrote:

I can just set.seed() on the example above for a test case, with a low niter. I'll get in together this weekend. Thanks for all the help you've given me!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/steve-the-bayesian/BOOM/issues/45?email_source=notifications&email_token=ABMVDVOE2PIIQZ74PXNOXCLRAHO5FA5CNFSM4KMTGVNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKISFHQ#issuecomment-579936926, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMVDVIHMGOMBLV32USFGODRAHO5FANCNFSM4KMTGVNA .

davidtedfordholt commented 4 years ago

I just want to make sure that my read on this is correct: The AddMonthlyAnnualCycle function will (functionally) create dummy variables for a slower frequency than that of the data. When I feed it half-hourly data, I get coefficients for each calendar day and it hangs when the sample size drops below between 33 and 62 observations. When I feed it daily, daily with odd days dropped, or weekly data, I get coefficients for each calendar month, with hangs below ~33 observations, ~ 60 observations, and ~35 observations, respectively. When I feed it monthly data, it hangs no matter how much data I give it.

It looks like the function is well-built to find monthly patterns in faster-than-monthly but no slower than daily data, given more than one observation within a given month (weekly works), as well as an observation in a preceding and following month. It creates coefficients for the months for which it has observations, and constructs those for months in which it doesn't have data as a single coefficient based on the observations in the preceding and following months.

I can constrain the frequencies to being faster than monthly and slower than or equal to daily, then apply the rule for months. That is relatively easy. I can define constraints for sub-daily data, as well, but that seems problematic, as the behavior (defining a daily average for each day of the year) doesn't seem like the desired behavior.

steve-the-bayesian commented 4 years ago

Hi David, I'm finally going to be able to work on some bsts bugs. This one wound up being an error in the slice sampler used to draw from the truncated gamma distribution. Tracking it down now. Steve

On Mon, Feb 3, 2020 at 8:37 AM David notifications@github.com wrote:

I just want to make sure that my read on this is correct: The AddMonthlyAnnualCycle function will (functionally) create dummy variables for a slower frequency than that of the data. When I feed it half-hourly data, I get coefficients for each calendar day and it hangs when the sample size drops below between 33 and 62 observations. When I feed it daily, daily with odd days dropped, or weekly data, I get coefficients for each calendar month, with hangs below ~33 observations, ~ 60 observations, and ~35 observations, respectively. When I feed it monthly data, it hangs no matter how much data I give it.

It looks like the function is well-built to find monthly patterns in faster-than-monthly but no slower than daily data, given more than one observation within a given month (weekly works), as well as an observation in a preceding and following month. It creates coefficients for the months for which it has observations, and constructs those for months in which it doesn't have data as a single coefficient based on the observations in the preceding and following months.

I can constrain the frequencies to being faster than monthly and slower than or equal to daily, then apply the rule for months. That is relatively easy. I can define constraints for sub-daily data, as well, but that seems problematic, as the behavior (defining a daily average for each day of the year) doesn't seem like the desired behavior.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/steve-the-bayesian/BOOM/issues/45?email_source=notifications&email_token=ABMVDVO3NGTY7PJADAOQ2VDRBBB6HA5CNFSM4KMTGVNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKUQQUI#issuecomment-581503057, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMVDVOPEADISS7VK2BPD3DRBBB6HANCNFSM4KMTGVNA .

steve-the-bayesian commented 4 years ago

fixed on the dev branch of the repo. will be part of the next release. The test case was super helpful, thanks!

On Sun, Mar 15, 2020 at 6:24 PM Steven Scott steve.the.bayesian@gmail.com wrote:

Hi David, I'm finally going to be able to work on some bsts bugs. This one wound up being an error in the slice sampler used to draw from the truncated gamma distribution. Tracking it down now. Steve

On Mon, Feb 3, 2020 at 8:37 AM David notifications@github.com wrote:

I just want to make sure that my read on this is correct: The AddMonthlyAnnualCycle function will (functionally) create dummy variables for a slower frequency than that of the data. When I feed it half-hourly data, I get coefficients for each calendar day and it hangs when the sample size drops below between 33 and 62 observations. When I feed it daily, daily with odd days dropped, or weekly data, I get coefficients for each calendar month, with hangs below ~33 observations, ~ 60 observations, and ~35 observations, respectively. When I feed it monthly data, it hangs no matter how much data I give it.

It looks like the function is well-built to find monthly patterns in faster-than-monthly but no slower than daily data, given more than one observation within a given month (weekly works), as well as an observation in a preceding and following month. It creates coefficients for the months for which it has observations, and constructs those for months in which it doesn't have data as a single coefficient based on the observations in the preceding and following months.

I can constrain the frequencies to being faster than monthly and slower than or equal to daily, then apply the rule for months. That is relatively easy. I can define constraints for sub-daily data, as well, but that seems problematic, as the behavior (defining a daily average for each day of the year) doesn't seem like the desired behavior.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/steve-the-bayesian/BOOM/issues/45?email_source=notifications&email_token=ABMVDVO3NGTY7PJADAOQ2VDRBBB6HA5CNFSM4KMTGVNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKUQQUI#issuecomment-581503057, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMVDVOPEADISS7VK2BPD3DRBBB6HANCNFSM4KMTGVNA .