tensorflow / benchmarks

A benchmark framework for Tensorflow
Apache License 2.0
1.14k stars 630 forks source link

NotFoundError: No CPU devices are available in this process #509

Closed ghost closed 3 years ago

ghost commented 3 years ago

Hello! I have built TF 1.13.2 from source code and I'm trying to run a two-nodes distributed learning on one machine (1cpu 1gpu) by using a cpu-only parameter server. However an error occured after I run the following commands:

nohup python tf_cnn_benchmarks.py --data_format=NHWC --num_gpus=0 --batch_size=8 --model=vgg16 --data_name=imagenet --variable_update=parameter_server --ps_hosts='localhost:2222' --worker_hosts='localhost:2223' --job_name='ps' --task_index=0 --local_parameter_device=cpu --device=cpu > ps.log &

Any help would be appreciated. Thanks!

Environment: OS: Ubuntu 18.04 GCC version: 4.8 CUDA and NCCL version: cuda 10.0 Framework version: TF 1.13

Here is my ps.log:


2021-01-02 14:56:01.077977: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:466] Not found: No CPU devices are available in this process
Traceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 73, in <module>
    app.run(main)  # Raises error on invalid flags, unlike tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "tf_cnn_benchmarks.py", line 63, in main
    bench = benchmark_cnn.BenchmarkCNN(params)
  File "/home/cluster/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1454, in __init__
    params, create_config_proto(params))
  File "/home/cluster/benchmarks/scripts/tf_cnn_benchmarks/platforms/default/util.py", line 42, in get_cluster_manager
    return cnn_util.GrpcClusterManager(params, config_proto)
  File "/home/cluster/benchmarks/scripts/tf_cnn_benchmarks/cnn_util.py", line 246, in __init__
    protocol=params.server_protocol)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/server_lib.py", line 148, in __init__
    self._server = c_api.TF_NewServer(self._server_def.SerializeToString())
tensorflow.python.framework.errors_impl.NotFoundError: No CPU devices are available in this process
reedwm commented 3 years ago

Unfortunately tf_cnn_benchmarks is unmaintained, so we do not plan on addressing issues. I recommend using the official models instead.

If you do want to keep using tf_cnn_benchmarks, try removing the --num_gpus=0 parameter, as IIRC you don't want that set when using CPUs (even if no GPUs are used). Also use the branch cnn_tf_v1.13_compatible if you aren't already doing so.