rocker-org / ml

experimental machine learning container
GNU General Public License v2.0
50 stars 13 forks source link

greta is not working #18

Open ignacio82 opened 5 years ago

ignacio82 commented 5 years ago

I was trying to play with greta using this container but I'm getting an error. This is what I am doing:

nvidia-docker run -it rocker/ml-gpu:latest bash

root@7dc3309926d4:/# nvidia-smi
Fri Apr 19 12:25:12 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116                Driver Version: 390.116                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0  On |                  N/A |
| 45%   42C    P0    27W / 120W |   1382MiB /  6076MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

root@7dc3309926d4:/# R

R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> x <- iris$Petal.Length
> y <- iris$Sepal.Length
> library(greta)

Attaching package: 'greta'

The following objects are masked from 'package:stats':

    binomial, poisson

The following objects are masked from 'package:base':

    %*%, backsolve, beta, colMeans, colSums, diag, forwardsolve, gamma,
    rowMeans, rowSums, sweep

> int <- normal(0, 5)
> coef <- normal(0, 3)
> sd <- lognormal(0, 3)
> mean <- int + coef * x
> distribution(y) <- normal(mean, sd)
> m <- model(int, coef, sd)
> draws <- mcmc(m, n_samples = 1000)

/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)
/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)
/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)
/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)
/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)
/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)
/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)
/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)
/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)
/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)
/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)
/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)
Error: greta hit a tensorflow error:

Error in py_call_impl(callable, dots$args, dots$keywords): NotFoundError: ./libdevice.compute_30.10.bc not found
     [[{{node cluster_0_1/xla_compile}} = _XlaCompile[Nresources=0, Targs=[DT_DOUBLE, DT_DOUBLE, DT_DOUBLE, DT_DOUBLE, DT_DOUBLE, DT_DOUBLE, DT_DOUBLE, DT_DOUBLE, DT_DOUBLE], Tconstants=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], function=cluster_0[_XlaCompiledKernel=true, _XlaNumConstantArgs=6, _XlaNumResourceArgs=0], _device="/job:localhost/replica:0/task:0/device:GPU:0"](Const, Tile_3/multiples/1, Reshape/shape, strided_slice_3/stack, strided_slice_3/stack_1, Sum_1/reduction_indices, _arg_Placeholder_0_0/_3, _arg_Placeholder_1_0_1/_5, _arg_Placeholder_2_0_2/_7, _arg_Placeholder_3_0_3/_9, _arg_Placeholder_4_0_4/_11, _arg_Placeholder_5_0_5/_13, _arg_Placeholder_6_0_6/_15, _arg_Placeholder_7_0_7/_17, _arg_Placeholder_8_0_8/_19)]]
     [[{{node cluster_0_1/xla_run/_1}} = _Recv[client_terminated=false, recv_device="/job:localh
cboettig commented 5 years ago

thanks for the report, I'll take a look.

cboettig commented 5 years ago

hmm... we can solve the errors such as NotFoundError: ./libdevice.compute_30.10.bc not found by copying /usr/local/cuda-9.0 from the rocker/cuda-dev image, but then I seem to be running up against https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc#L485-L489 instead.

Not exactly clear to me how to cherrypick ptxas 9.2.88 though.

Bumping all of cuda to 9.2.88 seems to break tensorflow, as it looks like the binaries installed by pip (for 0.12.0) are build only for cuda 9.0.

A second error I encounter, e.g. via either the virtualenv install route or in building on tensorflow/tensorflow:1.13.1-gpu-py3 is ValueError: Tensor conversion requested dtype int64 for Tensor with dtype int32. Longer trace below.

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: Tensor conversion requested dtype int64 for Tensor with dtype int32: 'Tensor("Placeholder_13:0", dtype=int32)'

Detailed traceback: 
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_probability/python/mcmc/sample.py", line 216, in sample_chain
    name="num_steps_between_results")
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1039, in convert_to_tensor
    return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1097, in convert_to_tensor_v2
    as_ref=False)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1175, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 977, in _TensorTensorConversionFunction
    (dtype.name, t.dtype.name, str(t)))

still digging...

goldingn commented 5 years ago

I think that last error just means you have the CRAN release of greta, but need the current GitHub version.

Something changed in the most recent Tensorflow Probability release, and the greta-side patch hasn't yet made its way to CRAN.

cboettig commented 5 years ago

@goldingn thanks Nick, that's the ticket!

@ignacio82 Once rocker/tensorflow-gpu builds (probably by tomorrow, or just docker build locally), you should be able to do a remotes::install_github("greta-dev/greta") and then gpu-accelerated greta should be working now.

Thanks again for the bug report, hadn't gotten around to testing greta, it's still somewhat early days for these ML images.

ignacio82 commented 5 years ago

Thanks! A couple of question:

  1. You said to use rocker/tensorflow-gpu but I think i should use rocker/ml-gpu:latest. With the former i got a mesage saying that i needed to install tensor flow probability. Is that right or should I use rocker/tensorflow-gpu ?
  2. Although greta seems to be working, I am getting the following message:
/usr/local/lib/python3.5/dist-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)

Is this a problem that the greta developers need to fix?

cboettig commented 5 years ago

@ignacio82 Right, I moved tensorflow-probability into the tensorflow image now since it seemed more logical to keep those together, but the latest rocker/tensorflow-gpu instance hasn't finished building. We're still figuring out the right organizational modularity.

Re the DeprecationWarning, yeah, I see that too, @goldingn can probably give us more insight on that but I don't think it's much of a problem.

pbhogale commented 4 years ago

Not sure this ought to be a different error or not, but I get a strange error when trying greta with the ml-gpu container.

remotes::install_github("greta-dev/greta")
rm(list=ls())
library(reticulate)
py_discover_config()
use_python("/opt/virtualenvs/r-tensorflow/bin/python")
use_virtualenv("/opt/virtualenvs/r-tensorflow/", required=T)
library(greta)
library(DiagrammeR)
library(bayesplot)
library(tidyverse)

length_of_data <- 100
sd_eps <- pi^exp(1)
intercept <- -5.0
slope <- pi
x <- seq(-10*pi, 10*pi, length.out = length_of_data)
y <- intercept + slope*x + rnorm(n = length_of_data, mean = 0, sd = sd_eps)
data <- data_frame(y = y, x = x)

intercept_p <- uniform(-10, 10)
sd_eps_p <- uniform(0, 50)
slope_p <- uniform(0, 10)

mean_y <- intercept_p+slope_p*x
distribution(y) <- normal(mean_y, sd_eps_p)
our_model <- model(intercept_p, slope_p, sd_eps_p)

num_samples <- 1000
param_draws <- mcmc(our_model, n_samples = num_samples, warmup = num_samples / 10)

that gives the error

Error in py_call_impl(callable, dots$args, dots$keywords) :
 ValueError: Tensor conversion requested dtype int64 for Tensor with dtype int32: 
'Tensor("Placeholder_13:0", dtype=int32)'
cboettig commented 4 years ago

So greta requires pretty careful coordination between versions of CUDA, tensorflow, and greta itself. I think this particular is due to using the most recent dev version of greta with an older tensorflow (see https://github.com/greta-dev/greta/issues/248).

We're still exploring the best way to help users triangulate these versions. (The current tensorflow-gpu image is iirc still on cuda 9.0, which is too old for tensorflow > 1.13 which is required for greta > 0.3.0 or so? don't quote me on those versions).

Can you try testing on rocker/ml:cuda-10.0? (Note that it should already have greta installed).