tensorflow / benchmarks

A benchmark framework for Tensorflow
Apache License 2.0
1.15k stars 633 forks source link

Running tf_cnn_benchmarks.py #71

Closed fmoo7 closed 6 years ago

fmoo7 commented 6 years ago

Hello,

I have copy the benchmarks folder under tensorflow directory.

(tensorflow) root@P50:/opt/DL/tensorflow# ls -all total 28 drwxr-xr-x 6 root root 4096 oct 22 13:00 . drwxr-xr-x 5 root root 4096 oct 22 16:53 .. drwxr-xr-x 8 root root 4096 oct 22 13:00 benchmarks drwxr-xr-x 2 root root 4096 oct 22 12:53 bin drwxr-xr-x 2 root root 4096 oct 22 12:50 include drwxr-xr-x 3 root root 4096 oct 22 12:50 lib -rw-r--r-- 1 root root 60 oct 22 12:50 pip-selfcheck.json

When trying to run tf_cnn_benchmark I am getting this error:

_(tensorflow) root@P50:/opt/DL/tensorflow/benchmarks/scripts# python3 tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 --batch_size=16 --model=inception3 --data_dir=/opt/DL/imagenet/datasets/ --variable_update=parameter_server --nodistortions Traceback (most recent call last): File "tf_cnn_benchmarks.py", line 26, in import benchmark_cnn File "/opt/DL/tensorflow/benchmarks/scripts/benchmark_cnn.py", line 41, in import cnn_util File "/opt/DL/tensorflow/benchmarks/scripts/cnnutil.py", line 40 print log ^ SyntaxError: Missing parentheses in call to 'print' (tensorflow) root@P50:/opt/DL/tensorflow/benchmarks/scripts#

Do I need to do something else before running the benchmark?

Thank you, Florin

tfboyd commented 6 years ago

Looks like a python3 issue based on the error message and specifically on line 40. We are a little sloppy on testing this script for python3 compatibility. Someone might have already submitted a PR to fix this but if not just add a parentheses to the print statement.

tfboyd commented 6 years ago

https://github.com/tensorflow/benchmarks/pull/72

tfboyd commented 6 years ago

Fixed by some kind person.

keightyfive commented 6 years ago

I get the same error even though I'm running it under Python 2.7, checked line 40 in cnn_util.py and no parentheses is there in the print statement...

tfboyd commented 6 years ago

I am checking it right now. A change must have been accepted that tweaked something.

keightyfive commented 6 years ago

Sorry sorry sorry... it's the line underneath File "/home/kklein/tf_cnn_benchmarks/cnn_util.py", line 41 if FLAGS.flush_stdout: ^ IndentationError: unexpected indent I'll let you know in a few minutes if I got it fixed...

keightyfive commented 6 years ago

As I assumed, fixing this leads to the next error: File "/home/kklein/tf_cnn_benchmarks/variable_mgr.py", line 29, in from tensorflow.contrib.all_reduce.python import all_reduce ImportError: No module named all_reduce.python

I'm not sure though if I should download the whole repo and then try again... will this work for python 2.7.5 without making too many changes?

tfboyd commented 6 years ago

I just tested under python 2.7 on my local and verified I had a clean master pull from the repo. For the latest benchmark code you need TF 1.4 due to the addition of all_reduce. I ran the command the original poster left and it was fine under python 2.7.

https://www.tensorflow.org/versions/r1.4/install/install_linux

tfboyd commented 6 years ago

If you sync to this sha-hash:d984e91 It will work with 1.3. I benchmarked that version with 1.3.

keightyfive commented 6 years ago

I installed TF 1.4 and ran the script, it gives me a huge error, apparently associated with Cuda and GPU initialisation, but also gives some error again regarding the all_reduce. It could also be something missing on the server though. Thanks a lot for your help though. Not quite sure what you mean with syncing to sha-hash:d984e91, but thanks.

srun --partition=amd-shortq python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --batch_size=64 --model=vgg16 --variable_update=replicated --use_nccl=True 2017-10-31 18:59:00.047556: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX FMA 2017-10-31 18:59:00.053555: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE 2017-10-31 18:59:00.053646: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: gpu01 2017-10-31 18:59:00.053662: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: gpu01 2017-10-31 18:59:00.053767: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 384.81.0 2017-10-31 18:59:00.053821: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.81 Sat Sep 2 02:43:11 PDT 2017 GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) """ 2017-10-31 18:59:00.053858: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 384.81.0 2017-10-31 18:59:00.053890: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 384.81.0 TensorFlow: 1.4 Model: vgg16 Mode: training SingleSess: False Batch size: 64 global 64 per device Devices: ['/gpu:0'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: nccl

Generating model Traceback (most recent call last): File "tf_cnn_benchmarks.py", line 46, in tf.app.run() File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "tf_cnn_benchmarks.py", line 42, in main bench.run() File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 892, in run return self._benchmark_cnn() File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1068, in _benchmark_cnn start_standard_services=start_standard_services) as sess: File "/usr/lib64/python2.7/contextlib.py", line 17, in enter return self.gen.next() File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop stop_grace_period_secs=self._stop_grace_secs) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session start_standard_services=start_standard_services) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session sess.run(init_op, feed_dict=init_feed_dict) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels: device='GPU'

     [[Node: NcclAllReduce_30 = NcclAllReduce[T=DT_FLOAT, num_devices=1, reduction="sum", shared_name="c30", _device="/device:GPU:0"](v0/tower_0/gradients/AddN_1)]]

Caused by op u'NcclAllReduce_30', defined at: File "tf_cnn_benchmarks.py", line 46, in tf.app.run() File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "tf_cnn_benchmarks.py", line 42, in main bench.run() File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 892, in run return self._benchmark_cnn() File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 986, in _benchmark_cnn (image_producer_ops, enqueue_ops, fetches) = self._build_model() File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1272, in _build_model all_top_5_ops, phase_train) File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1294, in _build_fetches self.variable_mgr.preprocess_device_grads(device_grads)) File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/variable_mgr.py", line 655, in preprocess_device_grads self._all_reduce_spec.shards, self.benchmark_cnn.gpu_indices) File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/variable_mgr.py", line 1090, in sum_gradients_all_reduce num_shards)) File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/variable_mgr.py", line 1010, in sum_grad_and_var_all_reduce summed_grads = nccl.all_sum(scaled_grads) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 49, in all_sum return _apply_all_reduce('sum', tensors) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 208, in _apply_all_reduce shared_name=shared_name)) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/nccl/ops/gen_nccl_ops.py", line 54, in nccl_all_reduce num_devices=num_devices, shared_name=shared_name, name=name) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels: device='GPU'

     [[Node: NcclAllReduce_30 = NcclAllReduce[T=DT_FLOAT, num_devices=1, reduction="sum", shared_name="c30", _device="/device:GPU:0"](v0/tower_0/gradients/AddN_1)]]

srun: error: gpu01: task 0: Exited with exit code 1 (tensorflow)[kklein@robotarium tf_cnn_benchmarks]$ module load cudnn/6.0(tensorflow)[kklein@robotarium tf_cnn_benchmarks]$ module load cuda80/toolkit(tensorflow)[kklein@robotarium tf_cnn_benchmarks]$ srun --partition=amd-shortq python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --batch_size=64 --model=vgg16 --variable_update=replicated --use_nccl=True 2017-10-31 19:00:50.870291: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX FMA 2017-10-31 19:00:50.876461: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE 2017-10-31 19:00:50.876568: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: gpu01 2017-10-31 19:00:50.876584: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: gpu01 2017-10-31 19:00:50.876693: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 384.81.0 2017-10-31 19:00:50.876740: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.81 Sat Sep 2 02:43:11 PDT 2017 GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) """ 2017-10-31 19:00:50.876777: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 384.81.0 2017-10-31 19:00:50.876791: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 384.81.0 TensorFlow: 1.4 Model: vgg16 Mode: training SingleSess: False Batch size: 64 global 64 per device Devices: ['/gpu:0'] Data format: NCHW Optimizer: sgd Variables: replicated AllReduce: nccl

Generating model Traceback (most recent call last): File "tf_cnn_benchmarks.py", line 46, in tf.app.run() File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "tf_cnn_benchmarks.py", line 42, in main bench.run() File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 892, in run return self._benchmark_cnn() File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1068, in _benchmark_cnn start_standard_services=start_standard_services) as sess: File "/usr/lib64/python2.7/contextlib.py", line 17, in enter return self.gen.next() File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop stop_grace_period_secs=self._stop_grace_secs) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session start_standard_services=start_standard_services) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session sess.run(init_op, feed_dict=init_feed_dict) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels: device='GPU'

     [[Node: NcclAllReduce_30 = NcclAllReduce[T=DT_FLOAT, num_devices=1, reduction="sum", shared_name="c30", _device="/device:GPU:0"](v0/tower_0/gradients/AddN_1)]]

Caused by op u'NcclAllReduce_30', defined at: File "tf_cnn_benchmarks.py", line 46, in tf.app.run() File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "tf_cnn_benchmarks.py", line 42, in main bench.run() File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 892, in run return self._benchmark_cnn() File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 986, in _benchmark_cnn (image_producer_ops, enqueue_ops, fetches) = self._build_model() File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1272, in _build_model all_top_5_ops, phase_train) File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1294, in _build_fetches self.variable_mgr.preprocess_device_grads(device_grads)) File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/variable_mgr.py", line 655, in preprocess_device_grads self._all_reduce_spec.shards, self.benchmark_cnn.gpu_indices) File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/variable_mgr.py", line 1090, in sum_gradients_all_reduce num_shards)) File "/home/kklein/benchmarks/scripts/tf_cnn_benchmarks/variable_mgr.py", line 1010, in sum_grad_and_var_all_reduce summed_grads = nccl.all_sum(scaled_grads) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 49, in all_sum return _apply_all_reduce('sum', tensors) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 208, in _apply_all_reduce shared_name=shared_name)) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/nccl/ops/gen_nccl_ops.py", line 54, in nccl_all_reduce num_devices=num_devices, shared_name=shared_name, name=name) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/home/kklein/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels: device='GPU'

     [[Node: NcclAllReduce_30 = NcclAllReduce[T=DT_FLOAT, num_devices=1, reduction="sum", shared_name="c30", _device="/device:GPU:0"](v0/tower_0/gradients/AddN_1)]]

srun: error: gpu01: task 0: Exited with exit code 1

tfboyd commented 6 years ago

git reset --hard d984e91

This may seem an odd question, but what are you trying to accomplish? This script is only useful if you are an expert and looking to get the most performance out of your code. This script is not easy to follow for most people. It can also be useful for testing hardware.

edit: it does look like it picked up the GPU. It is really hard to read your gigantically formatted error message.

keightyfive commented 6 years ago

Thanks. I'm a relatively fresh research student (as you can tell) who just recently got started with tensorflow. I'm looking into the behaviour of Neural Nets on various hardware architectures. That's why I want to run the benchmarks on a cluster which uses SLURM, with different nodes such as AMD, Intel and DGX-1. I'm literally trying to run the cmds from https://www.tensorflow.org/performance/performance_models with parameters but it seems as if i'm not loading the modules properly. I'm loading CUDNN6 and CUDA8 toolkit, but it seems I'm still doing something wrong as it fails to recognise the device, gpu, cuda etc.

tfboyd commented 6 years ago

I am not very good at these responses. There is not much I can do. The commands have changed a little in the latest script. We are using all_reduce_spec instead of use_nccl. It created problems for me as well. I would update the page but part of me prefers people not use the script unless they are willing to suffer through some of the struggles. Some changes we put in it are bleeding edge and only work with tensorflow head. I am not solid that this is a good script for your work load. It is good for testing hardware architectures and checking if TensorFlow is regressing but it is not great for training models as it is missing features to make that easy.

I am working to get better, faster, standard models in the garden. Any work not on that seems like a waste of effort to me. I keep getting distracted because I want to help with these problems but really I am not sure if helping really helps.

Yeah I am not sure what you are doing wrong. It is also really unfun to read your comment with the GIGANTIC TEXT. I also cannot guess what is wrong. 8 +6 should be fine

As an FYI if you are testing intel or amd without GPUs you might want to look at compiling with MKL.

keightyfive commented 6 years ago

Your help is very much appreciated and you're doing a great job. I also feel stupid going on forums and asking newbie questions of which I know I'm going to smile at myself sometime later when I have more experience. I am getting a different error than above and Cuda seems to load, and yes it was super ugly and I honestly didn't expect you to read through the whole thing, but thanks again. I think it's just that the examples on the homepage are customised towards specific architectures like Google Compute Engine or DGX-1 and I will have to play around with the parameters until it works for me, as every architecture and environment is always different. I'm also not sure how the parameters that can be passed to SLURM get along with the parameters for the scripts themselves. I'll just have to play around with it. Thanks again.

tfboyd commented 6 years ago

It was not a desire to not help. With SLURM added it is hard for me to guess. I am also going through my own troubles and trying to find my way. Part of that is that I am done working with this script. I authored maybe a few lines (handful) but I have been running it everywhere for most of the year.

Your log confused me although I do not doubt it. Normally when I see the driver version messages, things go bad and drop to CPU only. Maybe that is what happened and then I had doubts. One thing that makes that easy is if you run single gpu, as you did, and then resnet50 without a data dir and set variable update to parameter_server and local_parameter_device=CPU. That will allow it to work on CPU if GPU fails. Clearly not what you want but it confirms the problem as it should run but it will be slow as it will only be on CPU.

My hint that this happened is the last all_reduce nccl error, while cryotic said CPU in the message along with GPU. That made me think TF is telling us it has a GPU kernel for that but 'dude' you only have a CPU. :-)

If you were local and not SLURM, I would

That almost always fixes my CUDA issues or gives me enough hints to put things back together. The reinstalling TF is weird and why it works sometimes blows my mind.

Truely best of luck. I think I am done personally supporting the script but feel free to ping me directly. You can email me. My first and last name that I assume are on GitHub no spaces @google.com.

On Nov 1, 2017 1:16 AM, "Kevin Klein" notifications@github.com wrote:

Your help is very much appreciated and you're doing a great job. I also feel stupid going on forums and asking newbie questions of which I know I'm going to smile at myself sometime later when I have more experience. I am getting a different error than above and Cuda seems to load, and yes it was super ugly and I honestly didn't expect you to read through the whole thing, but thanks again. I think it's just that the examples on the homepage are customised towards specific architectures like Google Compute Engine or DGX-1 and I will have to play around with the parameters until it works for me, as every architecture and environment is always different. I'm also not sure how the parameters that can be passed to SLURM get along with the parameters for the scripts themselves. I'll just have to play around with it. Thanks again.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/71#issuecomment-341030369, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZeskHLTS4ThNn0ckzOyu6LbgsQLQ1Gks5syCjfgaJpZM4QB_Mx .

keightyfive commented 6 years ago

Hi, I have to apologise - it works now. As I already assumed it was something I missed out with SLURM. I requested the number of GPUs that I wanted, but didn't specify which resources I needed specifically, which you have to do with additional parameters. So the system then turned to the default CPU which of course gave all the CUDA related errors. The script itself seems to run fine, although giving me an update on TF 1.4 was nice as I was not aware of this. It probably works with TF 1.3 as well though... I might try that later. Sorry again for causing all the hassle and wasting your time, you can obviously close the thread now. Cheers

tfboyd commented 6 years ago

I make silly mistakes many times per day.

On Nov 1, 2017 8:17 AM, "Kevin Klein" notifications@github.com wrote:

Hi, I have to apologise - it works now. As I already assumed it was something I missed out with SLURM. I requested the number of GPUs that I wanted, but didn't specify which resources I needed specifically, which you have to do with additional parameters. So the system then turned to the default CPU which of course gave all the CUDA related errors. The script itself seems to run fine, although giving me an update on TF 1.4 was nice as I was not aware of this. It probably works with TF 1.3 as well though... I might try that later. Sorry again for causing all the hassle and wasting your time, you can obviously close the thread now. Cheers

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/71#issuecomment-341137139, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZeslqzLKFWjyG0qLa6wysf_H0K2PRUks5syIuGgaJpZM4QB_Mx .