`thin` and `pb_update` combinations result in `tensorflow` errors

Ilia-Kosenkov commented 4 years ago

This issues is, apprently, related to #241 (also #284). I believe, the root cause of #241 is that (according to docs) when pb_update <= thin, pb_update is selected to be thin + 1. This is extremely poor decision, because, as human beings, users tend to select thin divisable by e.g. 5, 10, therefore pb_udpate + 1 can be a prime number, and not a divisor of n_sample or warmup in the mcmc method.

Now, different values of pb_update cause greta to fail at the end of sampling with some python TensorArray-zero-size error. I ran some tests for different combinations of pb_update and thin to check, what values are causing those crashes.

The best thing is that mcmc(verbose = FALSE) appears to solve the problem, so greta crashes are actually caused by UI behaviour. I was unable to locate the source of the problem, but I would say it happens at the last sampling step (maybe, here). Changes, introduced in #284, might be causing the problem.

In the following table, look at #⁠4 and #⁠7, which update by 10x + 1. This appears to almost always crash, and it is exactly what happens if pb_update <= thin.

reticulate::use_condaenv("r-tensorflow-gpu", required = TRUE)
library(greta, warn.conflicts = FALSE)
library(purrr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(vctrs, warn.conflicts = FALSE)

run_model <- function(n_warm, n_spin, n_thin, n_pb_update) {
    x <- normal(0, 1)
    model <- model(x)
    result <-
        suppressWarnings(
            suppressMessages(
                mcmc(model, n_samples = n_spin, warmup = n_warm,
                    thin = n_thin, pb_update = n_pb_update,
                    chains = 4, n_cores = 8)))

    summary(result)$statistics
}

wrap_call <- function(n_warm, n_spin, n_thin, n_pb_update) {
    pmap(list(n_warm, n_spin, n_thin, n_pb_update),
          function(n_w, n_s, n_t, n_pb) {
              result <- tryCatch(run_model(n_w, n_s, n_t, n_pb),
                                 error = function(e) return(e))
              if (inherits(result, "error"))
                  return(list(Warmup = n_w, Spin = n_s, Thin = n_t, Pb = n_pb,
                              Mean = NA_real_, SD = NA_real_, Status = FALSE,
                              Message = result$message))
              else
                  return(list(Warmup = n_w, Spin = n_s, Thin = n_t, Pb = n_pb,
                              Mean = result["Mean"], SD = result["SD"], Status = TRUE,
                              Message = "OK"))
        }) %>% bind_rows
}

wrap_call(
        n_warm = 1000L,
        n_spin = 1000L,
        n_thin = vec_c(1L, 25L, 30L, 50L, 10L, 10L, 10L, 10L),
        n_pb_update = vec_c(50L, 50L, 50L, 50L, 20L, 23L, 31L, 99L)
    ) ->> temp

print(temp)
#> # A tibble: 8 x 8
#>   Warmup  Spin  Thin    Pb     Mean     SD Status Message                  
#>    <int> <int> <int> <int>    <dbl>  <dbl> <lgl>  <chr>                    
#> 1   1000  1000     1    50  0.00881  1.00  TRUE   OK                       
#> 2   1000  1000    25    50 -0.122    1.11  TRUE   OK                       
#> 3   1000  1000    30    50 -0.245    1.05  TRUE   OK                       
#> 4   1000  1000    50    50 NA       NA     FALSE  "greta hit a tensorflow ~
#> 5   1000  1000    10    20  0.0183   0.987 TRUE   OK                       
#> 6   1000  1000    10    23 -0.0226   0.971 TRUE   OK                       
#> 7   1000  1000    10    31 NA       NA     FALSE  "greta hit a tensorflow ~
#> 8   1000  1000    10    99  0.0791   0.942 TRUE   OK

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> .....

reticulate::py_config()
#> python:         C:\PROGRA~3\MINICO~1\envs\R-TENS~1\python.exe
#> libpython:      C:/PROGRA~3/MINICO~1/envs/R-TENS~1/python37.dll
#> pythonhome:     C:\PROGRA~3\MINICO~1\envs\R-TENS~1
#> version:        3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
#> Architecture:   64bit
#> numpy:          C:\PROGRA~3\MINICO~1\envs\R-TENS~1\lib\site-packages\numpy
#> numpy_version:  1.17.0
#> 
#> python versions found: 
#>  C:\PROGRA~3\MINICO~1\envs\R-TENS~1\python.exe
#>  C:\PROGRA~3\MINICO~1\python.exe
#>  C:\ProgramData\Miniconda3\python.exe
#>  C:\ProgramData\Miniconda3\envs\r-tensorflow-gpu\python.exe

reticulate::use_condaenv("r-tensorflow-gpu", required = TRUE)
tensorflow::tf_config()
#> TensorFlow v1.14.0 (C:\PROGRA~3\MINICO~1\envs\R-TENS~1\lib\site-packages\tensorflow\__init__.p)
#> Python v3.7 (C:\ProgramData\Miniconda3\envs\r-tensorflow-gpu\python.exe)
tensorflow::tf_gpu_configured()
#> TensorFlow built with CUDA:  TRUE 
#> GPU device name:  /device:GPU:0
#> [1] TRUE

packageDescription("greta")
#> Package: greta
#> Type: Package
#> Title: Simple and Scalable Statistical Modelling in R
#> Version: 0.3.1.9004
#> Date: 2019-08-25
#> ... 
#> Encoding: UTF-8
#> LazyData: true
#> Depends: R (>= 3.0)
#> Collate: 'package.R' 'utils.R' 'tf_functions.R' 'overloaded.R'
#> .....
#> Imports: R6, tensorflow (>= 1.13.0), reticulate, progress (>=
#>        1.2.0), future, coda, methods
#> Suggests: knitr, rmarkdown, DiagrammeR, bayesplot, lattice,
#>        testthat, mvtnorm, MCMCpack, rmutil, extraDistr, truncdist,
#>        tidyverse, fields, MASS, abind, spelling
#> VignetteBuilder: knitr
#> RoxygenNote: 6.1.1
#> Language: en-GB
#> RemoteType: github
#> RemoteHost: api.github.com
#> RemoteRepo: greta
#> RemoteUsername: greta-dev
#> RemoteRef: master
#> RemoteSha: 4d08524a7e13bc759a1c66fd1657455803c2ebf5
#> GithubRepo: greta
#> GithubUsername: greta-dev
#> GithubRef: master
#> GithubSHA1: 4d08524a7e13bc759a1c66fd1657455803c2ebf5
#> NeedsCompilation: no
#> .....
#> Built: R 3.6.1; ; 2019-09-06 12:11:50 UTC; windows

^{Created on 2019-09-26 by the reprex package (v0.3.0)}

greta hit a tensorflow error:

Error in py_call_impl(callable, dots$args, dots$keywords): UnimplementedError: 2 root error(s) found.
  (0) Unimplemented: TensorArray has size zero, but element shape [?] is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
     [[node mcmc_sample_chain/trace_scan/TensorArrayStack_2/TensorArrayGatherV3 (defined at \tensorflow_probability\python\mcmc\internal\util.py:372) ]]
     [[mcmc_sample_chain/trace_scan/while/smart_for_loop/while/Exit_5/_75]]
  (1) Unimplemented: TensorArray has size zero, but element shape [?] is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
     [[node mcmc_sample_chain/trace_scan/TensorArrayStack_2/TensorArrayGatherV3 (defined at \tensorflow_probability\python\mcmc\internal\util.py:372) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node mcmc_sample_chain/trace_scan/TensorArrayStack_2/TensorArrayGatherV3:
 mcmc_sample_chain/trace_scan/while/Exit_13 (defined at \tensorflow_probability\python\mcmc\internal\util.py:370)   
 mcmc_sample_chain/trace_scan/TensorArray_3 (defined at \tensorflow_probability\python\mcmc\internal\util.py:355)

Input Source operations connected to node mcmc_sample_chain/trace_scan/TensorArrayStack_2/TensorArrayGatherV3:
 mcmc_sample_chain/trace_scan/while/Exit_13 (defined at \tensorflow_probability\python\mcmc\internal\util.py:370)   
 mcmc_sample_chain/trace_scan/TensorArray_3 (defined at \tensorflow_probability\python\mcmc\internal\util.py:355)

Original stack trace for 'mcmc_sample_chain/trace_scan/TensorArrayStack_2/TensorArrayGatherV3':
  File "\tensorflow_probability\python\mcmc\sample.py", line 361, in sample_chain
    parallel_iterations=parallel_iterations)
  File "\tensorflow_probability\python\mcmc\internal\util.py", line 372, in trace_scan
    stacked_trace = tf.nest.map_structure(lambda x: x.stack(), trace_arrays)
  File "\tensorflow\python\util\nest.py", line 515, in map_structure
    structure[0], [func(*x) for x in entries],
  File "\tensorflow\python\util\nest.py", line 515, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "\tensorflow_probability\python\mcmc\internal\util.py", line 372, in <lambda>
    stacked_trace = tf.nest.map_structure(lambda x: x.stack(), trace_arrays)
  File "\tensorflow\python\ops\tensor_array_ops.py", line 1205, in stack
    return self._implementation.stack(name=name)
  File "\tensorflow\python\ops\tensor_array_ops.py", line 309, in stack
    return self.gather(math_ops.range(0, self.size()), name=name)
  File "\tensorflow\python\ops\tensor_array_ops.py", line 323, in gather
    element_shape=element_shape)
  File "\tensorflow\python\ops\gen_data_flow_ops.py", line 7167, in tensor_array_gather_v3
    element_shape=element_shape, name=name)
  File "\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "\tensorflow\python\framework\ops.py", line 3616, in create_op
    op_def=op_def)
  File "\tensorflow\python\framework\ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

Detailed traceback: 
  File "C:\PROGRA~3\MINICO~1\envs\R-TENS~1\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
    run_metadata_ptr)
  File "C:\PROGRA~3\MINICO~1\envs\R-TENS~1\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\PROGRA~3\MINICO~1\envs\R-TENS~1\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
    run_metadata)
  File "C:\PROGRA~3\MINICO~1\envs\R-TENS~1\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)

goldingn commented 4 years ago

Thanks @Ilia-Kosenkov. Yeah, this is a still an issue, and the patch (to increment thin) was a bit of a hack. I'm planning a bigger edit to the samplers soon that will do thinning and sampler tuning on the TensorFlow side, rather than in R. That should resolve this issue and make the warmup period faster too.

goldingn commented 4 years ago

It's worth pointing out that if you are using HMC, it's very rarely worth thinning. Instead, I'd recommend increasing the number of leapfrog steps in the HMC sampler, ie. something like mcmc(model, sampler = hmc(Lmin = 10, Lmax = 15)). Increasing the number of leapfrog steps has the same proportional impact on run time as increasing the number of samples, and thinning by the same ratio (ie. doubling the number of leapfrog steps takes as long to evaluate as doubling the number of samples and thinning by 2), but with more leapfrog steps, the sampler is better able to explore the posterior. So you'll generally do better changing that parameter rather than the thinning.

Automatic tuning of the number of leapfrog steps should be in the minor release, as should the new NUTS implementation, which adapts the number of leapfrog steps during sampling.

Ilia-Kosenkov commented 4 years ago

Thanks for the advice, @goldingn! I understand that thinning is not worth tuning, but you can still get a crash if you tune pb_update (I think), that's why I looked into what's happening at different thin/pb_update. Not sure how different warmup phase implementation is from the sampling, but it is indeed slower, maybe by a factor of 2-3, in my setups, and appears to be less CPU-efficient when running multiple chains/multiple cores. Any improvements to that will be greatly appreciated.

goldingn commented 4 years ago

Right. Warmup is currently slow because we only do a few (1-3) iterations at a time in tensorflow before having to pass things back to R to tune the parameters. That incurs some overhead each time, which adds up, especially when it's a small model. TF Probability now has a tuning interface, so when we switch over to using that we'll save a lot of overhead.

I think I might even be able to iterate the progress bar without leaving tensorflow too, which would speed up sampling a little too.

greta-dev / greta

`thin` and `pb_update` combinations result in `tensorflow` errors #318