sql-machine-learning / elasticdl

Kubernetes-native Deep Learning Framework
https://elasticdl.org
MIT License
732 stars 113 forks source link

GRPC error on worker node if you sequentially submit multiple training commands #1630

Open xiaoyili opened 4 years ago

xiaoyili commented 4 years ago

2020-01-12 19:54:43.426092: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2020-01-12 19:54:43.432883: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz 2020-01-12 19:54:43.433201: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5a60590 executing computations on platform Host. Devices: 2020-01-12 19:54:43.433221: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/elasticdl/python/worker/main.py", line 39, in main() File "/elasticdl/python/worker/main.py", line 34, in main worker = Worker(args, channel=master_channel, ps_channels=ps_channels) File "/elasticdl/python/worker/worker.py", line 116, in init self._init_from_args(args) File "/elasticdl/python/worker/worker.py", line 156, in _init_from_args self.set_model(model_inst) File "/elasticdl/python/worker/worker.py", line 191, in set_model self._init_embeddings() File "/elasticdl/python/worker/worker.py", line 254, in _init_embeddings self.report_embedding_info() File "/elasticdl/python/worker/worker.py", line 405, in report_embedding_info self._ps_stubs[ps_id].push_embedding_info(model) File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 824, in call return _end_unary_response_blocking(state, call, False, None) File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 726, in _end_unary_response_blocking raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses" debug_error_string = "{"created":"@1578858883.548828938","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1578858883.548825519","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}"

skydoorkai commented 4 years ago

Can you give out more detailed repro steps? How to sequentially submit multiple training commands?

xiaoyili commented 4 years ago

Lets take the piece from scripts/client_test.sh for example: elasticdl train \ --image_base=elasticdl:ci \ --model_zoo=model_zoo \ --model_def=deepfm_functional_api.deepfm_functional_api.custom_model \ --training_data=/data/frappe/train \ --validation_data=/data/frappe/test \ --num_epochs=1 \ --master_resource_request="cpu=0.2,memory=1024Mi" \ --master_resource_limit="cpu=1,memory=2048Mi" \ --worker_resource_request="cpu=0.4,memory=2048Mi" \ --worker_resource_limit="cpu=1,memory=3072Mi" \ --ps_resource_request="cpu=0.2,memory=1024Mi" \ --ps_resource_limit="cpu=1,memory=2048Mi" \ --minibatch_size=64 \ --num_minibatches_per_task=2 \ --num_workers=$WORKER_NUM \ --num_ps_pods=$PS_NUM \ --checkpoint_steps=500 \ --evaluation_steps=500 \ --tensorboard_log_dir=/tmp/tensorboard-log \ --grads_to_wait=1 \ --use_async=True \ --job_name=test-train \ --log_level=INFO \ --image_pull_policy=Never \ --output=/saved_model/model_output \ --volume="host_path=${PWD},mount_path=/saved_model"

I can paste this command in a shell (say, test.sh), change the job_name and run 'nohup test.sh &'. This will submit multiple training requests.

skydoorkai commented 4 years ago

Can you still repro this issue with the current ElasticDL version?