Closed ghost closed 3 years ago
Unfortunately tf_cnn_benchmarks is unmaintained, so we do not plan on addressing issues. I recommend using the official models instead.
If you do want to keep using tf_cnn_benchmarks, try removing the --num_gpus=0
parameter, as IIRC you don't want that set when using CPUs (even if no GPUs are used). Also use the branch cnn_tf_v1.13_compatible
if you aren't already doing so.
Hello! I have built TF 1.13.2 from source code and I'm trying to run a two-nodes distributed learning on one machine (1cpu 1gpu) by using a cpu-only parameter server. However an error occured after I run the following commands:
nohup python tf_cnn_benchmarks.py --data_format=NHWC --num_gpus=0 --batch_size=8 --model=vgg16 --data_name=imagenet --variable_update=parameter_server --ps_hosts='localhost:2222' --worker_hosts='localhost:2223' --job_name='ps' --task_index=0 --local_parameter_device=cpu --device=cpu > ps.log &
Any help would be appreciated. Thanks!
Environment: OS: Ubuntu 18.04 GCC version: 4.8 CUDA and NCCL version: cuda 10.0 Framework version: TF 1.13
Here is my ps.log: