runtime issues with different docker and nvidia-docker2 versions

tobigithub commented 5 years ago

Hi, I had to upgrade/dowgrade docker and nvidia-docker because of missing images and version for ubuntu16 and currently cuda10. For the current version docker versions on https://download.docker.com/linux/ubuntu/dists/xenial/pool/edge/amd64/ there was no matching nvidia-docker-container so basically I pinned specific versions and installed 18.03 and NVIDIA Docker: 2.0.3

### get available versions
apt-cache madison nvidia-docker2 nvidia-container-runtime
nvidia-docker2 | 2.0.3+docker18.03.0-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker17.12.1-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker17.12.0-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker17.09.1-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker17.09.0-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker17.06.2-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker17.03.2-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker1.13.1-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker1.12.6-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.2+docker17.12.0-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages

nvidia-container-runtime | 2.0.0+docker17.12.1-1 | https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64  Packages
nvidia-container-runtime | 2.0.0+docker17.12.0-1 | https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64  Packages
nvidia-container-runtime | 2.0.0+docker17.09.1-1 | https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64  Packages
nvidia-container-runtime | 2.0.0+docker17.09.0-1 | https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64  Packages
nvidia-container-runtime | 2.0.0+docker17.06.2-1 | https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64  Packages

Docker and the nvidia-docker2 run fine.

Docker version 18.03.1-ce, build 9ee9f40

and

NVIDIA Docker: 2.0.3
Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:20 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:15:30 2018
  OS/Arch:      linux/amd64
  Experimental: false

The installation runs fine, i also can use nvidia-smi inside docker

sudo docker run --runtime=nvidia --rm nvidia/cuda:10.0-base nvidia-smi
Thu May 30 03:07:56 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:02:00.0  On |                  N/A |
| 29%   36C    P8     3W / 250W |  10456MiB / 10988MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I have downloaded the prosit1 model with config.yml, model.yml and weight_32_0.10211.hdf5 however when I run make server MODEL=/home/xxx/prosit/prosit1/ the server will start and greet me, but uploading a file with curl will do so, but then break. curl -F "peptides=@examples/peptidelist.csv" http://127.0.0.1:5000/predict/

sudo make server MODEL=/home/xxx/prosit/prosit1
nvidia-docker build -qf Dockerfile -t prosit .
sha256:d224c2ac898b32662b6265ae7a37dd1872dd98defe4cf86c6fd3acbe7f006c2e
nvidia-docker run -it \
    -v "/home/xxx/prosit/prosit1":/root/model/ \
    -e CUDA_VISIBLE_DEVICES=0 \
    -p 5000:5000 \
    prosit python3 -m prosit.server
Using TensorFlow backend.
/root/prosit/model.py:38: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
/usr/local/lib/python3.5/dist-packages/keras/engine/saving.py:349: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(yaml_string)
 * Serving Flask app "server" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
[2019-05-30 03:00:20,819] ERROR in app: Exception on /predict/ [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 2311, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1834, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1737, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.5/dist-packages/flask/_compat.py", line 36, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1832, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1818, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/root/prosit/server.py", line 28, in predict
    result = prediction.predict(tensor, model, model_config)
  File "/root/prosit/prediction.py", line 14, in predict
    model.compile(optimizer="adam", loss="mse")
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 333, in compile
    sample_weight, mask)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training_utils.py", line 403, in weighted
    score_array = fn(y_true, y_pred)
  File "/usr/local/lib/python3.5/dist-packages/keras/losses.py", line 14, in mean_squared_error
    return K.mean(K.square(y_pred - y_true), axis=-1)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 848, in binary_op_wrapper
    with ops.name_scope(None, op_name, [x, y]) as name:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 5770, in __enter__
    g = _get_graph_from_inputs(self._values)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 5428, in _get_graph_from_inputs
    _assert_same_graph(original_graph_element, graph_element)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 5364, in _assert_same_graph
    original_item))
ValueError: Tensor("out_target:0", shape=(?, ?), dtype=float32) must be from the same graph as Tensor("out/Reshape:0", shape=(?, ?), dtype=float32).

Not sure how to debug this could be Keras of TF incompatibility with CUDA10? However I have Keras and TF running with different versions successfully outside docker. Tobias

gessulat commented 5 years ago

Thank you for the detailed logs! Our environment uses CUDA 9.2. @tkschmidt and me are at a conference from today to mid June. We will look into that when we are back. Sorry for the inconvenience! :(

tobigithub commented 5 years ago

Thank you.

gessulat commented 5 years ago

When thinking about this error ValueError: Tensor("out_target:0", shape=(?, ?), dtype=float32) must be from the same graph as Tensor("out/Reshape:0", shape=(?, ?), dtype=float32). it occurred to me, that this may indicate that the model did not load properly.

Please try 'make jump MODEL=/path/to/the/model/. You should get an interactive bash with a~/model/where you can find the model mounted. It could be that Docker is confused because there is no/` at the model directories path.

tobigithub commented 5 years ago

Please try 'make jump MODEL=/path/to/the/model/. You should get an interactive bash with a~/model/where you can find the model mounted. It could be that Docker is confused because there is no/` at the model directories path.

Tried that, it works, I can see all the files and HDF5, but loading the server with our without "/" gives the same error. What is the next step for "jump", writing a little CSV reader to predict the file inside docker?

tony-jy-zhao commented 5 years ago

coming across the same issue. I was using cuda9.0

gessulat commented 5 years ago

@tony-jy-zhao Please understand that we cannot troubleshoot individual cuda versions and the errors they are causing.

@tobigithub you can use ipython to interactively step through the script that would usually called by make. No need to write a parser. '/examples/' has example csv files that we tested and worked. Specifically run:

    from prosit import constants
    from prosit import model as model_lib
    model_dir = constants.MODEL_DIR
    global model
    global model_config
    model, model_config = model_lib.load(model_dir, trained=True)

The server script is here: https://github.com/kusterlab/prosit/blob/master/prosit/server.py

gessulat commented 5 years ago

As make jump is working this seems to be unrelated to docker versions. More likely a duplicate of #2

tobigithub commented 5 years ago

@gessulat thanks, the jump config works with CUDA 10, the server breaks. Will open new issue.

kusterlab / prosit

runtime issues with different docker and nvidia-docker2 versions #4