hongyuanjia / epluspar

Conduct parametric analysis on EnergyPlus models in R
https://hongyuanjia.github.io/epluspar
Other
9 stars 0 forks source link

Efficient Bayesian Calibration code #1

Open adChong opened 5 years ago

adChong commented 5 years ago

Evaluate improvement in code efficiency for the following

hongyuanjia commented 4 years ago

Test for chiller COP and capacity calibration, using the optimized bcWithPred.stan in bc-stan repo.

Since it seems to never complete using 7-days hourly data, below I change it to 3-days 6-hourly data.

library(eplusr)
library(epScan)

# use an example file from EnergyPlus v8.8 for demonstration here
path_idf <- file.path(eplus_config(8.8)$dir, "ExampleFiles", "RefBldgLargeOfficeNew2004_Chicago.idf")
path_epw <- file.path(eplus_config(8.8)$dir, "WeatherData", "USA_CA_San.Francisco.Intl.AP.724940_TMY3.epw")
bc <- bayes_job(path_idf, path_epw)

# set parameters
bc$input("CoolSys1 Chiller 1", paste("chiller evaporator", c("inlet temperature", "outlet temperature", "mass flow rate")), "hourly")
bc$output("CoolSys1 Chiller 1", "chiller electric power", "hourly")
bc$param(
    `CoolSys1 Chiller 1` = list(reference_cop = c(4, 6), reference_capacity = c(2.5e6, 3.0e6)),
    .names = c("cop1", "cap1"), .num_sim = 5
)

# check sample values
bc$samples()

# run simulations
bc$eplus_run(dir = tempdir(), run_period = list("example", 7, 1, 7, 3), echo = FALSE)

# set simulation data
bc$data_sim("6 hour")

# use the seed model to get field data
## clone the seed model
seed <- bc$seed()$clone()
## remove existing RunPeriod objects
seed$RunPeriod <- NULL
## set run period as the same as in `$eplus_run()`
seed$add(RunPeriod = list("test", 7, 1, 7, 3))
seed$SimulationControl$set(
    `Run Simulation for Sizing Periods` = "No",
    `Run Simulation for Weather File Run Periods` = "Yes"
)
## save the model to tempdir
seed$save(tempfile(fileext = ".idf"))
## run
job <- seed$run(bc$weather(), echo = FALSE)
## get data
fan_power <- epScan:::report_dt_aggregate(job$report_data(name = bc$output()$variable_name, all = TRUE), "6 hour")
fan_power <- eplusr:::report_dt_to_wide(fan_power)

# set field data
bc$data_field(fan_power[, -"Date/Time"])

# check stan input
stan_data <- bc$data_bc()

system.time(res <- bc$stan_run(iter = 500, chains = 2))
hongyuanjia commented 4 years ago

It completes after more than half hour. Definitely not a good results.

Warnings

# Warning messages:
# 1: There were 500 transitions after warmup that exceeded the maximum tree
# depth. Increase max_treedepth above 10. See
# http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
# 2: There were 1 chains where the estimated Bayesian Fraction of Missing I
# nformation was low. See
# http://mc-stan.org/misc/warnings.html#bfmi-low
# 3: Examine the pairs() plot to diagnose sampling problems
# 4: The largest R-hat is 2.99, indicating chains have not mixed.
# Running the chains for more iterations may help. See
# http://mc-stan.org/misc/warnings.html#r-hat
# 5: Bulk Effective Samples Size (ESS) is too low, indicating posterior mea
# ns and medians may be unreliable.
# Running the chains for more iterations may help. See
# http://mc-stan.org/misc/warnings.html#bulk-ess
# 6: Tail Effective Samples Size (ESS) is too low, indicating posterior var
# iances and tail quantiles may be unreliable.
# Running the chains for more iterations may help. See
# http://mc-stan.org/misc/warnings.html#tail-ess

trace plot

rstan::stan_trace(res$fit)

trace

histogram

rstan::stan_hist(res$fit)

hist

hongyuanjia commented 4 years ago
rstan::check_hmc_diagnostics(res$fit)
# Divergences:
# 0 of 500 iterations ended with a divergence.
#
# Tree depth:
# 500 of 500 iterations saturated the maximum tree depth of 10 (100%).
# Try increasing 'max_treedepth' to avoid saturation.
#
# Energy:
# E-BFMI indicated possible pathological behavior:
#   Chain 2: E-BFMI = 0.189
# E-BFMI below 0.2 indicates you may need to reparameterize your model.
hongyuanjia commented 4 years ago

FYI: https://mc-stan.org/misc/warnings.html

Maximum treedepth exceeded

Warnings about hitting the maximum treedepth are not as serious as warnings about divergent transitions. While divergent transitions are a validity concern, hitting the maximum treedepth is an efficiency concern. Configuring the No-U-Turn-Sampler (the variant of HMC used by Stan) involves putting a cap on the depth of the trees that it evaluates during each iteration (for details on this see the Hamiltonian Monte Carlo Sampling chapter in the Stan manual). This is controlled through a maximum depth parameter max_treedepth. When the maximum allowed tree depth is reached it indicates that NUTS is terminating prematurely to avoid excessively long execution time.

Transitions that hit the maximum treedepth are plotted in yellow in the pairs() plot.

Recommendations:

Increase the maximum allowed treedepth. In RStan, max_treedepth is one of the parameters that you can include in the optional control list passed to the stan or sampling functions. For example, to set max_treedepth to 15 (the default is 10) you would do this:

stan(..., control = list(max_treedepth = 15))

BFMI low

You may see a warning that says some number of chains had an estimated Bayesian Fraction of Missing Information (BFMI) that was too low. This implies that the adaptation phase of the Markov Chains did not turn out well and those chains likely did not explore the posterior distribution efficiently. For more details on this diagnostic, see https://arxiv.org/abs/1604.00695.

Recommendations:

Look at the pairs plot to see which primitive parameters are correlated with the energy margin. There should be a negative relationship between lp and energy in the pairs plot, which is not a concern because lp is the logarithm of the posterior kernel rather than a primitive parameter. Reparameterize your model. The primitive parameters that are correlated with the energy__ margin in the pairs plot are a good place to start thinking about reparameterizations. You might try setting a higher value for the iter and / or warmup arguments. By default warmup is half of iter and iter is 2000 by default.

adChong commented 4 years ago

It completes after more than half hour. Definitely not a good results.

Warnings

# Warning messages:
# 1: There were 500 transitions after warmup that exceeded the maximum tree
# depth. Increase max_treedepth above 10. See
# http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
# 2: There were 1 chains where the estimated Bayesian Fraction of Missing I
# nformation was low. See
# http://mc-stan.org/misc/warnings.html#bfmi-low
# 3: Examine the pairs() plot to diagnose sampling problems
# 4: The largest R-hat is 2.99, indicating chains have not mixed.
# Running the chains for more iterations may help. See
# http://mc-stan.org/misc/warnings.html#r-hat
# 5: Bulk Effective Samples Size (ESS) is too low, indicating posterior mea
# ns and medians may be unreliable.
# Running the chains for more iterations may help. See
# http://mc-stan.org/misc/warnings.html#bulk-ess
# 6: Tail Effective Samples Size (ESS) is too low, indicating posterior var
# iances and tail quantiles may be unreliable.
# Running the chains for more iterations may help. See
# http://mc-stan.org/misc/warnings.html#tail-ess

trace plot

rstan::stan_trace(res$fit)

trace

histogram

rstan::stan_hist(res$fit)

hist

I was wondering whether it is due to the way the problem is formulated. The "field data used now" is the simulation output. However, in the Stan code, we model a discrepancy term and an observation error term that does not really exist in reality since we are using simulated data. Perhaps we need to add discrepancy and noise to the "field data"

hongyuanjia commented 4 years ago

So it means that there is no "observation error" term in this case. What kind of noise you would suggest to add to the simulated data?

adChong commented 4 years ago

So it means that there is no "observation error" term in this case. What kind of noise you would suggest to add to the simulated data?

I'm thinking to add Gaussian noise. Discrepancy error is the one that is difficult to simulate. Might want to consider using the test dataset in bc-stan instead as an illustration of Bayesian calibration

hongyuanjia commented 4 years ago

OK. Will use the test dataset in bc-stan to see if things get better.

At the mean time, I added a Gaussian noise with 5% mean and sd for testing.

field_data <- fan_power[, -"Date/Time"][
    , lapply(.SD, function (x) x + rnorm(length(x), mean = mean(x) * 0.05, sd = 0.05 * sd(x)))][
    , lapply(.SD, function (x) {x[x < 0] <- 0; x})
]
adChong commented 4 years ago

OK. Will use the test dataset in bc-stan to see if things get better.

At the mean time, I added a Gaussian noise with 5% mean and sd for testing.

field_data <- fan_power[, -"Date/Time"][
    , lapply(.SD, function (x) x + rnorm(length(x), mean = mean(x) * 0.05, sd = 0.05 * sd(x)))][
    , lapply(.SD, function (x) {x[x < 0] <- 0; x})
]

mean = 0 instead of mean = mean(x) * 0.05?

hongyuanjia commented 4 years ago

Oh yes. You are right.

hongyuanjia commented 4 years ago

Actually, this wired trace plot was caused by a bug in epScan, which did not supply standardized yf to Stan. This bug has been fix via 0770626a3ab4920bfe8e57b5352b589a82c08b49.

library(eplusr)
library(epScan)

# use an example file from EnergyPlus v8.8 for demonstration here
path_idf <- file.path(eplus_config(8.8)$dir, "ExampleFiles", "RefBldgLargeOfficeNew2004_Chicago.idf")
path_epw <- file.path(eplus_config(8.8)$dir, "WeatherData", "USA_CA_San.Francisco.Intl.AP.724940_TMY3.epw")
bc <- bayes_job(path_idf, path_epw)

# set parameters
bc$input("CoolSys1 Chiller 1", paste("chiller evaporator", c("inlet temperature", "outlet temperature", "mass flow rate")), "hourly")
bc$output("CoolSys1 Chiller 1", "chiller electric power", "hourly")
bc$param(
    `CoolSys1 Chiller 1` = list(reference_cop = c(4, 6), reference_capacity = c(2.5e6, 3.0e6)),
    .names = c("cop1", "cap1"), .num_sim = 5
)

# check sample values
bc$samples()

# run simulations
bc$eplus_run(dir = tempdir(), run_period = list("example", 7, 1, 7, 3), echo = FALSE)

# set simulation data
bc$data_sim("6 hour")

# use the seed model to get field data
## clone the seed model
seed <- bc$seed()$clone()
## remove existing RunPeriod objects
seed$RunPeriod <- NULL
## set run period as the same as in `$eplus_run()`
seed$add(RunPeriod = list("test", 7, 1, 7, 3))
seed$SimulationControl$set(
    `Run Simulation for Sizing Periods` = "No",
    `Run Simulation for Weather File Run Periods` = "Yes"
)
## save the model to tempdir
seed$save(tempfile(fileext = ".idf"))
## run
job <- seed$run(bc$weather(), echo = FALSE)
## get data
fan_power <- epScan:::report_dt_aggregate(job$report_data(name = bc$output()$variable_name, all = TRUE), "6 hour")
fan_power <- eplusr:::report_dt_to_wide(fan_power)

# add Gaussian noice
fan_power <- fan_power[, -"Date/Time"][
    , lapply(.SD, function (x) x + rnorm(length(x), sd = 0.05 * sd(x)))][
    , lapply(.SD, function (x) {x[x < 0] <- 0; x})
]

# set field data
data_field <- bc$data_field(fan_power)

# check stan input
stan_data <- bc$data_bc()

# run stan
res <- bc$stan_run(iter = 300, chains = 3)

trace plot

rstan::stan_trace(res$fit)

trace_chiller

histogram

rstan::stan_trace(res$fit)

hist_chiller

prediction

## get yf
yf <- data_field$output[, .SD, .SDcols = -c("case", "Date/Time")][, xf := .I]
data.table::setnames(yf, c("yf", "xf"))

## get predictive inference y_pred and convert back to original scale
y_pred <- data.table::melt.data.table(res$y_pred,
    variable.name = "xf", value.name = "pred", variable.factor = FALSE
)
y_pred[, xf := as.integer(gsub("V", "", xf))]

## combine
plot_data <- yf[y_pred, on = "xf"]

## plot
ggplot(data = plot_data, aes(y = pred, x = xf, group = xf)) +
    geom_boxplot(outlier.size = 0.2) +
    geom_point(data = yf, aes(x = xf, y = yf), color = "#D55E00", size = 0.8)

pred_chiller