rstudio / tensorflow

TensorFlow for R
https://tensorflow.rstudio.com
Apache License 2.0
1.33k stars 318 forks source link

tf$device("/cpu0") or tf$device("/gpu0") result in "IndexError: list index out of range" #272

Closed antspengy closed 5 years ago

antspengy commented 5 years ago

Firstly, great package and documentation so thanks a lot for putting this all together.

Keras and tensorflow in R have been working well for me in cpu and gpu mode on both local Windows 10 and remote Azure Linux Ubuntu 16.04 installations. I'm now at the point of experimenting with multiple gpu model building and have been following instructions from RStudio Tensorflow multi_gpu model reference. However I keep encountering an IndexError error with all references to tf$device in my code. For instance, the code: library(tensorflow) tf_config() sess = tf$Session() hello <- tf$constant('Hello, TensorFlow!') sess$run(hello)

with(tf$device("/cpu:0"), { const <- tf$constant(42)})

results in the following console output:

[1] "Hello, TensorFlow!" TensorFlow v1.11.0 (~/.virtualenvs/r-reticulate/local/lib/python2.7/site-packages/keras)

Python v2.7 (~/.virtualenvs/r-reticulate/bin/python)

Error in py_call_impl(callable, dots$args, dots$keywords) : IndexError: list index out of range Detailed traceback: File "/usr/lib/python2.7/contextlib.py", line 17, in enter return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 4243, in device self._add_device_to_stack(device_name_or_function, offset=2) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 4182, in _add_device_to_stack self._device_function_stack.push_obj(spec, offset=total_offset) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/traceable_stack.py", line 106, in push_obj return traceable_obj.set_filename_and_line_from_caller(offset + 1) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/traceable_stack.py", line 64, in set_filename_and_line_from_caller self.filename, self.lineno = frame_records[negative_offset][:2]

However in the same session I can run equivalent code in python without a problem: library(reticulate) r_constant <- py_run_string(" import tensorflow as tf with tf.device('/cpu:0'): const = tf.constant(42)") print(r_constant)

and it appears to be perfectly fine with output:

{'const': <tf.Tensor 'Const_5:0' shape=() dtype=int32>, 'builtins': <module 'builtin' (built-in)>, 'package': None, 'sys': <module 'sys' (built-in)>, 'R': <class 'main.R'>, 'tf': <module 'tensorflow' from '/usr/local/lib/python2.7/dist-packages/tensorflow/init.pyc'>, 'name': 'main', 'r': <__main__.R>, 'doc': None}

Note that all other aspects of the tensorflow and keras applications seem to be working well, and I haven't been able to resolve this specific issue even when using tensorflow 1.09 or 1.10 or using install_tensorflow with my anaconda or virtual instances using Python 3.5 and 3.6. The same issue occurs when I use 'gpu:0' as well. I haven't encountered any problems doing this operational natively in python in jupyter notebook instances on the same environments either.

Any thoughts/suggestions would be much appreciated, thanks.

skeydan commented 5 years ago

1.) Do you say the exact same happens when running python 3.6?

2.) Trying to explore... (I cannot reproduce; your code runs fine for me)

tf$python$client$device_lib$list_local_devices()
library(tensorflow)
sess <- tf$Session(config = tf$ConfigProto(log_device_placement=TRUE))
const <- tf$constant(42)
sess$run(const)
antspengy commented 5 years ago

Hi,

  1. Yes, it seems to be the same message for Python 3.6, 3.5 or2.7. Here's the output when I make tensorflow use my anaconda 3.6 python (note I took out my username from the directory listing):

Error in py_call_impl(callable, dots$args, dots$keywords) : IndexError: list index out of range

Detailed traceback: File "/home/#MYUSERNAME#/anaconda3/envs/machine/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) File "/home/#MYUSERNAME#/anaconda3/envs/machine/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4243, in device self._add_device_to_stack(device_name_or_function, offset=2) File "/home/#MYUSERNAME#/anaconda3/envs/machine/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4182, in _add_device_to_stack self._device_function_stack.push_obj(spec, offset=total_offset) File "/home/#MYUSERNAME#/anaconda3/envs/machine/lib/python3.6/site-packages/tensorflow/python/framework/traceable_stack.py", line 106, in push_obj return traceable_obj.set_filename_and_line_from_caller(offset + 1) File "/home/#MYUSERNAME#/anaconda3/envs/machine/lib/python3.6/site-packages/tensorflow/python/framework/traceable_stack.py", line 64, in set_filenam

  1. The output of tf$python$client$device_lib$list_local_devices() is:

    [[1]] name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 12436588316063005634

    [[2]] name: "/device:GPU:0" device_type: "GPU" memory_limit: 15862523495 locality { bus_id: 1 links { } } incarnation: 11549627479208016464 physical_device_desc: "device: 0, name: Tesla P100-PCIE-16GB, pci bus id: ab3a:00:00.0, compute capability: 6.0"

The output of the other commands are:

library(tensorflow) sess <- tf$Session(config = tf$ConfigProto(log_device_placement=TRUE)) 2018-10-15 12:49:21.505240: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-10-15 12:49:22.314706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: ab3a:00:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-10-15 12:49:22.314751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0 2018-10-15 12:49:22.603346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-10-15 12:49:22.603390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 2018-10-15 12:49:22.603398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N 2018-10-15 12:49:22.603695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15127 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: ab3a:00:00.0, compute capability: 6.0) Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: ab3a:00:00.0, compute capability: 6.0 2018-10-15 12:49:22.746300: I tensorflow/core/common_runtime/direct_session.cc:291] Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: ab3a:00:00.0, compute capability: 6.0

const <- tf$constant(42) sess$run(const) Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0 2018-10-15 12:49:22.752159: I tensorflow/core/common_runtime/placer.cc:922] Const: (Const)/job:localhost/replica:0/task:0/device:GPU:0 [1] 42

skeydan commented 5 years ago

thanks! I can in fact reproduce when I use an env that has a GPU.

To be followed up in

https://github.com/rstudio/reticulate/issues/365

javierluraschi commented 5 years ago

@skeydan should be possible to test this fix with the nightly build tomorrow and then remove the fix to install 1.10 by default.

jjallaire commented 5 years ago

Let's wait to remove the 1.10 by default until they actually ship this in v1.12.

J.J.

On Fri, Nov 2, 2018 at 6:24 PM Javier Luraschi notifications@github.com wrote:

@skeydan https://github.com/skeydan should be possible to test this fix with the nightly build tomorrow and then remove the fix to install 1.10 by default.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rstudio/tensorflow/issues/272#issuecomment-435526744, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx2I9ezgECQW-Zsd2oH2tT6iqfmNcks5urMYXgaJpZM4XcLkf .

skeydan commented 5 years ago

Hi Javier, thanks for taking care of this!

I just tested on a fresh (as of last night) build of 1.12 though and I still get the error:

> with(tf$device("/cpu:0"), {
+   const <- tf$constant(42)
+ })
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  IndexError: list index out of range

Detailed traceback: 
  File "/home/key/anaconda3/envs/tf-12-gpu/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/key/anaconda3/envs/tf-12-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4245, in device
    self._add_device_to_stack(device_name_or_function, offset=2)
  File "/home/key/anaconda3/envs/tf-12-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4184, in _add_device_to_stack
    self._device_function_stack.push_obj(spec, offset=total_offset)
  File "/home/key/anaconda3/envs/tf-12-gpu/lib/python3.6/site-packages/tensorflow/python/framework/traceable_stack.py", line 106, in push_obj
    return traceable_obj.set_filename_and_line_from_caller(offset + 1)
  File "/home/key/anaconda3/envs/tf-12-gpu/lib/python3.6/site-packages/tensorflow/python/framework/traceable_stack.py", line 64, in set_filename_and_line_from_caller
    self.filename, self.lin

Looks like they merged it into master but not branch 1.12...

I'll build from master to verify.

skeydan commented 5 years ago

Okay so I can confirm it works on TF master... But this leaves open the question which version to install by default. I wonder how big the ratio will be of new installers that will actually suffer from it and if we wouldn't want to switch to 1.12 (instead of staying with 1.10) all the same (assuming we don't want to install the nightly by default).

jjallaire commented 5 years ago

I think we should stay with 1.10 until 1.13 comes out.