Closed fmoo7 closed 6 years ago
Looks like a python3 issue based on the error message and specifically on line 40. We are a little sloppy on testing this script for python3 compatibility. Someone might have already submitted a PR to fix this but if not just add a parentheses to the print statement.
Fixed by some kind person.
I get the same error even though I'm running it under Python 2.7, checked line 40 in cnn_util.py and no parentheses is there in the print statement...
I am checking it right now. A change must have been accepted that tweaked something.
Sorry sorry sorry... it's the line underneath File "/home/kklein/tf_cnn_benchmarks/cnn_util.py", line 41 if FLAGS.flush_stdout: ^ IndentationError: unexpected indent I'll let you know in a few minutes if I got it fixed...
As I assumed, fixing this leads to the next error:
File "/home/kklein/tf_cnn_benchmarks/variable_mgr.py", line 29, in
I'm not sure though if I should download the whole repo and then try again... will this work for python 2.7.5 without making too many changes?
I just tested under python 2.7 on my local and verified I had a clean master pull from the repo. For the latest benchmark code you need TF 1.4 due to the addition of all_reduce. I ran the command the original poster left and it was fine under python 2.7.
https://www.tensorflow.org/versions/r1.4/install/install_linux
If you sync to this sha-hash:d984e91 It will work with 1.3. I benchmarked that version with 1.3.
I installed TF 1.4 and ran the script, it gives me a huge error, apparently associated with Cuda and GPU initialisation, but also gives some error again regarding the all_reduce. It could also be something missing on the server though. Thanks a lot for your help though. Not quite sure what you mean with syncing to sha-hash:d984e91, but thanks.
Generating model
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 46, in
[[Node: NcclAllReduce_30 = NcclAllReduce[T=DT_FLOAT, num_devices=1, reduction="sum", shared_name="c30", _device="/device:GPU:0"](v0/tower_0/gradients/AddN_1)]]
Caused by op u'NcclAllReduce_30', defined at:
File "tf_cnn_benchmarks.py", line 46, in
InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels: device='GPU'
[[Node: NcclAllReduce_30 = NcclAllReduce[T=DT_FLOAT, num_devices=1, reduction="sum", shared_name="c30", _device="/device:GPU:0"](v0/tower_0/gradients/AddN_1)]]
Generating model
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 46, in
[[Node: NcclAllReduce_30 = NcclAllReduce[T=DT_FLOAT, num_devices=1, reduction="sum", shared_name="c30", _device="/device:GPU:0"](v0/tower_0/gradients/AddN_1)]]
Caused by op u'NcclAllReduce_30', defined at:
File "tf_cnn_benchmarks.py", line 46, in
InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels: device='GPU'
[[Node: NcclAllReduce_30 = NcclAllReduce[T=DT_FLOAT, num_devices=1, reduction="sum", shared_name="c30", _device="/device:GPU:0"](v0/tower_0/gradients/AddN_1)]]
srun: error: gpu01: task 0: Exited with exit code 1
git reset --hard d984e91
This may seem an odd question, but what are you trying to accomplish? This script is only useful if you are an expert and looking to get the most performance out of your code. This script is not easy to follow for most people. It can also be useful for testing hardware.
edit: it does look like it picked up the GPU. It is really hard to read your gigantically formatted error message.
Thanks. I'm a relatively fresh research student (as you can tell) who just recently got started with tensorflow. I'm looking into the behaviour of Neural Nets on various hardware architectures. That's why I want to run the benchmarks on a cluster which uses SLURM, with different nodes such as AMD, Intel and DGX-1. I'm literally trying to run the cmds from https://www.tensorflow.org/performance/performance_models with parameters but it seems as if i'm not loading the modules properly. I'm loading CUDNN6 and CUDA8 toolkit, but it seems I'm still doing something wrong as it fails to recognise the device, gpu, cuda etc.
I am not very good at these responses. There is not much I can do. The commands have changed a little in the latest script. We are using all_reduce_spec instead of use_nccl. It created problems for me as well. I would update the page but part of me prefers people not use the script unless they are willing to suffer through some of the struggles. Some changes we put in it are bleeding edge and only work with tensorflow head. I am not solid that this is a good script for your work load. It is good for testing hardware architectures and checking if TensorFlow is regressing but it is not great for training models as it is missing features to make that easy.
I am working to get better, faster, standard models in the garden. Any work not on that seems like a waste of effort to me. I keep getting distracted because I want to help with these problems but really I am not sure if helping really helps.
Yeah I am not sure what you are doing wrong. It is also really unfun to read your comment with the GIGANTIC TEXT. I also cannot guess what is wrong. 8 +6 should be fine
As an FYI if you are testing intel or amd without GPUs you might want to look at compiling with MKL.
Your help is very much appreciated and you're doing a great job. I also feel stupid going on forums and asking newbie questions of which I know I'm going to smile at myself sometime later when I have more experience. I am getting a different error than above and Cuda seems to load, and yes it was super ugly and I honestly didn't expect you to read through the whole thing, but thanks again. I think it's just that the examples on the homepage are customised towards specific architectures like Google Compute Engine or DGX-1 and I will have to play around with the parameters until it works for me, as every architecture and environment is always different. I'm also not sure how the parameters that can be passed to SLURM get along with the parameters for the scripts themselves. I'll just have to play around with it. Thanks again.
It was not a desire to not help. With SLURM added it is hard for me to guess. I am also going through my own troubles and trying to find my way. Part of that is that I am done working with this script. I authored maybe a few lines (handful) but I have been running it everywhere for most of the year.
Your log confused me although I do not doubt it. Normally when I see the driver version messages, things go bad and drop to CPU only. Maybe that is what happened and then I had doubts. One thing that makes that easy is if you run single gpu, as you did, and then resnet50 without a data dir and set variable update to parameter_server and local_parameter_device=CPU. That will allow it to work on CPU if GPU fails. Clearly not what you want but it confirms the problem as it should run but it will be slow as it will only be on CPU.
My hint that this happened is the last all_reduce nccl error, while cryotic said CPU in the message along with GPU. That made me think TF is telling us it has a GPU kernel for that but 'dude' you only have a CPU. :-)
If you were local and not SLURM, I would
That almost always fixes my CUDA issues or gives me enough hints to put things back together. The reinstalling TF is weird and why it works sometimes blows my mind.
Truely best of luck. I think I am done personally supporting the script but feel free to ping me directly. You can email me. My first and last name that I assume are on GitHub no spaces @google.com.
On Nov 1, 2017 1:16 AM, "Kevin Klein" notifications@github.com wrote:
Your help is very much appreciated and you're doing a great job. I also feel stupid going on forums and asking newbie questions of which I know I'm going to smile at myself sometime later when I have more experience. I am getting a different error than above and Cuda seems to load, and yes it was super ugly and I honestly didn't expect you to read through the whole thing, but thanks again. I think it's just that the examples on the homepage are customised towards specific architectures like Google Compute Engine or DGX-1 and I will have to play around with the parameters until it works for me, as every architecture and environment is always different. I'm also not sure how the parameters that can be passed to SLURM get along with the parameters for the scripts themselves. I'll just have to play around with it. Thanks again.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/71#issuecomment-341030369, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZeskHLTS4ThNn0ckzOyu6LbgsQLQ1Gks5syCjfgaJpZM4QB_Mx .
Hi, I have to apologise - it works now. As I already assumed it was something I missed out with SLURM. I requested the number of GPUs that I wanted, but didn't specify which resources I needed specifically, which you have to do with additional parameters. So the system then turned to the default CPU which of course gave all the CUDA related errors. The script itself seems to run fine, although giving me an update on TF 1.4 was nice as I was not aware of this. It probably works with TF 1.3 as well though... I might try that later. Sorry again for causing all the hassle and wasting your time, you can obviously close the thread now. Cheers
I make silly mistakes many times per day.
On Nov 1, 2017 8:17 AM, "Kevin Klein" notifications@github.com wrote:
Hi, I have to apologise - it works now. As I already assumed it was something I missed out with SLURM. I requested the number of GPUs that I wanted, but didn't specify which resources I needed specifically, which you have to do with additional parameters. So the system then turned to the default CPU which of course gave all the CUDA related errors. The script itself seems to run fine, although giving me an update on TF 1.4 was nice as I was not aware of this. It probably works with TF 1.3 as well though... I might try that later. Sorry again for causing all the hassle and wasting your time, you can obviously close the thread now. Cheers
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/71#issuecomment-341137139, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZeslqzLKFWjyG0qLa6wysf_H0K2PRUks5syIuGgaJpZM4QB_Mx .
Hello,
I have copy the benchmarks folder under tensorflow directory.
(tensorflow) root@P50:/opt/DL/tensorflow# ls -all total 28 drwxr-xr-x 6 root root 4096 oct 22 13:00 . drwxr-xr-x 5 root root 4096 oct 22 16:53 .. drwxr-xr-x 8 root root 4096 oct 22 13:00 benchmarks drwxr-xr-x 2 root root 4096 oct 22 12:53 bin drwxr-xr-x 2 root root 4096 oct 22 12:50 include drwxr-xr-x 3 root root 4096 oct 22 12:50 lib -rw-r--r-- 1 root root 60 oct 22 12:50 pip-selfcheck.json
When trying to run tf_cnn_benchmark I am getting this error:
_(tensorflow) root@P50:/opt/DL/tensorflow/benchmarks/scripts# python3 tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 --batch_size=16 --model=inception3 --data_dir=/opt/DL/imagenet/datasets/ --variable_update=parameter_server --nodistortions Traceback (most recent call last): File "tf_cnn_benchmarks.py", line 26, in
import benchmark_cnn
File "/opt/DL/tensorflow/benchmarks/scripts/benchmark_cnn.py", line 41, in
import cnn_util
File "/opt/DL/tensorflow/benchmarks/scripts/cnnutil.py", line 40
print log
^
SyntaxError: Missing parentheses in call to 'print'
(tensorflow) root@P50:/opt/DL/tensorflow/benchmarks/scripts#
Do I need to do something else before running the benchmark?
Thank you, Florin