kusterlab / prosit

Prosit offers high quality MS2 predicted spectra for any organism and protease as well as iRT prediction. When using Prosit is helpful for your research, please cite "Gessulat, Schmidt et al. 2019" DOI 10.1038/s41592-019-0426-7
https://www.proteomicsdb.org/prosit/
Apache License 2.0
85 stars 45 forks source link

runtime issues with different docker and nvidia-docker2 versions #4

Closed tobigithub closed 5 years ago

tobigithub commented 5 years ago

Hi, I had to upgrade/dowgrade docker and nvidia-docker because of missing images and version for ubuntu16 and currently cuda10. For the current version docker versions on https://download.docker.com/linux/ubuntu/dists/xenial/pool/edge/amd64/ there was no matching nvidia-docker-container so basically I pinned specific versions and installed 18.03 and NVIDIA Docker: 2.0.3

### get available versions
apt-cache madison nvidia-docker2 nvidia-container-runtime
nvidia-docker2 | 2.0.3+docker18.03.0-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker17.12.1-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker17.12.0-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker17.09.1-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker17.09.0-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker17.06.2-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker17.03.2-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker1.13.1-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.3+docker1.12.6-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages
nvidia-docker2 | 2.0.2+docker17.12.0-1 | https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  Packages

nvidia-container-runtime | 2.0.0+docker17.12.1-1 | https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64  Packages
nvidia-container-runtime | 2.0.0+docker17.12.0-1 | https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64  Packages
nvidia-container-runtime | 2.0.0+docker17.09.1-1 | https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64  Packages
nvidia-container-runtime | 2.0.0+docker17.09.0-1 | https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64  Packages
nvidia-container-runtime | 2.0.0+docker17.06.2-1 | https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64  Packages

Docker and the nvidia-docker2 run fine.

Docker version 18.03.1-ce, build 9ee9f40

and

NVIDIA Docker: 2.0.3
Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:20 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:15:30 2018
  OS/Arch:      linux/amd64
  Experimental: false

The installation runs fine, i also can use nvidia-smi inside docker

sudo docker run --runtime=nvidia --rm nvidia/cuda:10.0-base nvidia-smi
Thu May 30 03:07:56 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:02:00.0  On |                  N/A |
| 29%   36C    P8     3W / 250W |  10456MiB / 10988MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I have downloaded the prosit1 model with config.yml, model.yml and weight_32_0.10211.hdf5 however when I run make server MODEL=/home/xxx/prosit/prosit1/ the server will start and greet me, but uploading a file with curl will do so, but then break. curl -F "peptides=@examples/peptidelist.csv" http://127.0.0.1:5000/predict/

sudo make server MODEL=/home/xxx/prosit/prosit1
nvidia-docker build -qf Dockerfile -t prosit .
sha256:d224c2ac898b32662b6265ae7a37dd1872dd98defe4cf86c6fd3acbe7f006c2e
nvidia-docker run -it \
    -v "/home/xxx/prosit/prosit1":/root/model/ \
    -e CUDA_VISIBLE_DEVICES=0 \
    -p 5000:5000 \
    prosit python3 -m prosit.server
Using TensorFlow backend.
/root/prosit/model.py:38: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
/usr/local/lib/python3.5/dist-packages/keras/engine/saving.py:349: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(yaml_string)
 * Serving Flask app "server" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
[2019-05-30 03:00:20,819] ERROR in app: Exception on /predict/ [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 2311, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1834, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1737, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.5/dist-packages/flask/_compat.py", line 36, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1832, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1818, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/root/prosit/server.py", line 28, in predict
    result = prediction.predict(tensor, model, model_config)
  File "/root/prosit/prediction.py", line 14, in predict
    model.compile(optimizer="adam", loss="mse")
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 333, in compile
    sample_weight, mask)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training_utils.py", line 403, in weighted
    score_array = fn(y_true, y_pred)
  File "/usr/local/lib/python3.5/dist-packages/keras/losses.py", line 14, in mean_squared_error
    return K.mean(K.square(y_pred - y_true), axis=-1)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 848, in binary_op_wrapper
    with ops.name_scope(None, op_name, [x, y]) as name:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 5770, in __enter__
    g = _get_graph_from_inputs(self._values)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 5428, in _get_graph_from_inputs
    _assert_same_graph(original_graph_element, graph_element)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 5364, in _assert_same_graph
    original_item))
ValueError: Tensor("out_target:0", shape=(?, ?), dtype=float32) must be from the same graph as Tensor("out/Reshape:0", shape=(?, ?), dtype=float32).

Not sure how to debug this could be Keras of TF incompatibility with CUDA10? However I have Keras and TF running with different versions successfully outside docker. Tobias

gessulat commented 5 years ago

Thank you for the detailed logs! Our environment uses CUDA 9.2. @tkschmidt and me are at a conference from today to mid June. We will look into that when we are back. Sorry for the inconvenience! :(

tobigithub commented 5 years ago

Thank you.

gessulat commented 5 years ago

When thinking about this error ValueError: Tensor("out_target:0", shape=(?, ?), dtype=float32) must be from the same graph as Tensor("out/Reshape:0", shape=(?, ?), dtype=float32). it occurred to me, that this may indicate that the model did not load properly.

Please try 'make jump MODEL=/path/to/the/model/. You should get an interactive bash with a~/model/where you can find the model mounted. It could be that Docker is confused because there is no/` at the model directories path.

tobigithub commented 5 years ago

Please try 'make jump MODEL=/path/to/the/model/. You should get an interactive bash with a~/model/where you can find the model mounted. It could be that Docker is confused because there is no/` at the model directories path.

Tried that, it works, I can see all the files and HDF5, but loading the server with our without "/" gives the same error. What is the next step for "jump", writing a little CSV reader to predict the file inside docker?

tony-jy-zhao commented 5 years ago

coming across the same issue. I was using cuda9.0

gessulat commented 5 years ago

@tony-jy-zhao Please understand that we cannot troubleshoot individual cuda versions and the errors they are causing.

@tobigithub you can use ipython to interactively step through the script that would usually called by make. No need to write a parser. '/examples/' has example csv files that we tested and worked. Specifically run:

    from prosit import constants
    from prosit import model as model_lib
    model_dir = constants.MODEL_DIR
    global model
    global model_config
    model, model_config = model_lib.load(model_dir, trained=True)

The server script is here: https://github.com/kusterlab/prosit/blob/master/prosit/server.py

gessulat commented 5 years ago

As make jump is working this seems to be unrelated to docker versions. More likely a duplicate of #2

tobigithub commented 5 years ago

@gessulat thanks, the jump config works with CUDA 10, the server breaks. Will open new issue.