nginyc / rafiki

Rafiki is a distributed system that supports training and deployment of machine learning models using AutoML, built with ease-of-use in mind.
Apache License 2.0
36 stars 23 forks source link

examples/TfFeedForward.py does not run correctly #179

Open easyfan327 opened 4 years ago

easyfan327 commented 4 years ago
  1. add TfFeedForward.py to model
  2. start new train job
  3. the new train job is labeled as STARTED however never proceed to RUNNING

p.s. executed bash scripts/setup_node.sh to enable GPU support

easyfan327 commented 4 years ago

logs in worker for reference: Traceback (most recent call last): File "/root/rafiki/utils/service.py", line 50, in run_worker start_worker(service_id, service_type, container_id) File "scripts/start_worker.py", line 40, in start_worker worker.start() File "/root/rafiki/worker/train.py", line 56, in start self._monitor.pull_job_info() File "/root/rafiki/worker/train.py", line 257, in pull_job_info self.model_class = load_model_class(model.model_file_bytes, model.model_class) File "/root/rafiki/model/utils.py", line 51, in load_model_class raise InvalidModelClassError(e) rafiki.model.utils.InvalidModelClassError: Traceback (most recent call last): File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in from tensorflow.python.pywrap_tensorflow_internal import * File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in _pywrap_tensorflow_internal = swig_import_helper() File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description) File "/usr/local/envs/rafiki/lib/python3.6/imp.py", line 243, in load_module return load_dynamic(name, filename, file) File "/usr/local/envs/rafiki/lib/python3.6/imp.py", line 343, in load_dynamic return _load(spec) ImportError: /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short

pinpom commented 4 years ago

hi @easyfan327, since rafiki has been upgraded to version 0.2.0, it is recommended that you install the most updated version of rafiki from nginyc/rafiki/master. Please remember to delete any old rafiki's instances (incl. docker images and containers) remaining on your machine before installing the new version. When scaling rafiki on GPU, also remember to add 'GPU_COUNT': 1 to budget while you create a train job (refer to latest doc: https://nginyc.github.io/rafiki/docs/0.2.0/src/python/rafiki.client.html#rafiki.client.Client.create_train_job). For example: client.create_train_job( app='fashion_mnist_app', task='IMAGE_CLASSIFICATION', train_dataset_id='70efcbf6-b576-44d0-83b7-fd93e8ee03d3', val_dataset_id='9c28d97a-3d08-4903-b217-1169a13e5d6a', budget={ 'MODEL_TRIAL_COUNT': 5, 'GPU_COUNT': 1}, models=[ 'b67f3017-8f37-45cc-a7c5-a3f8912ac72e' ] ) I have no problem while running this model. FYR, attached herewith the code Please try again and let me know if there's any issues.