tensorflow / ngraph-bridge

TensorFlow-nGraph bridge
Other
137 stars 64 forks source link

PlaidML devices do not work #509

Open zap-wizard opened 4 years ago

zap-wizard commented 4 years ago

All devices do not work on PlaidML backend. However I have got the devices working on PlaidML with Keras (plaidml-keras), so I'm not really sure if the problem is in PlaidML or in nGraph...

ngraph-bridge=0.22.0-rc4 PlaidML=0.6.4 MacOS High Sierra=10.13.6

Following tests are run with mnist_deep_simplified.py in examples.

With device metal_intel(r)_hd_graphics_unknown.0 got error:

Extracting /tmp/teotensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Extracting /tmp/teotensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Extracting /tmp/teotensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Extracting /tmp/teotensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Saving graph to: /tmp/teo/tensorboard-logs/mnist-convnet
2020-04-27 08:42:54.788778: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-27 08:42:54.909516: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_rewrite_pass.cc:205] NGraph using backend: PlaidML
2020-04-27 08:42:55.247518: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_rewrite_pass.cc:205] NGraph using backend: PlaidML
2020-04-27 08:42:55.436098: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_utils.cc:408] Serializing graph to: tf_function_error_ngraph_cluster_23.json

Traceback (most recent call last):
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Caught exception while executing nGraph computation: Second argument of index must be an integer
     [[{{node ngraph_cluster_23}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/teo/ngraph-bridge/build_cmake/examples/mnist/mnist_deep_simplified.py", line 288, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/Users/teo/ngraph-bridge/build_cmake/examples/mnist/mnist_deep_simplified.py", line 245, in main
    train_mnist_cnn(FLAGS)
  File "/Users/teo/ngraph-bridge/build_cmake/examples/mnist/mnist_deep_simplified.py", line 211, in train_mnist_cnn
    keep_prob: 1.0
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 731, in eval
    return _eval_using_default_session(self, feed_dict, self.graph, session)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 5579, in _eval_using_default_session
    return session.run(tensors, feed_dict)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Caught exception while executing nGraph computation: Second argument of index must be an integer
     [[{{node ngraph_cluster_23}}]]

With device metal_amd_radeon_pro_560.0 got error:

Extracting /tmp/teotensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Extracting /tmp/teotensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Extracting /tmp/teotensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Extracting /tmp/teotensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Saving graph to: /tmp/teo/tensorboard-logs/mnist-convnet
2020-04-27 08:41:32.003804: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-27 08:41:32.122316: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_rewrite_pass.cc:205] NGraph using backend: PlaidML
2020-04-27 08:41:32.364542: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_rewrite_pass.cc:205] NGraph using backend: PlaidML
[ERR] 2020-04-27T05:41:52z src/ngraph/runtime/plaidml/plaidml_logger.cpp 46 NewComputePipelineState(kernel_c6_sdk_16) failed: Compiler encountered an internal error
[ERR] 2020-04-27T05:41:52z src/ngraph/runtime/plaidml/plaidml_logger.cpp 46 Source code: 
// Original:
// X_T38[d0 : _T53] = <(X_T37[d0, d1])
// With Index Variables Made Integral:
// X_T38[d0 : _T53] = <(X_T37[d0, d1]), 500000000 + d0 < 1000000000, 500000000 + d1 < 1000000000
// Constraints:{ 0 <= d1 < 10, 0 <= d0 < 50, 0 <= d0 < 50, 0 <= 500000000 + d0 < 1000000000, 0 <= 500000000 + d1 < 1000000000 }
// Merged Parallel Constraints:{ 0 <= d1 < 10, 0 <= d0 < 50 }
// Defracted:
// X_T38[d0 : _T53] = <(X_T37[d0, d1]), 500000000 + d0 < 1000000000, 500000000 + d1 < 1000000000
// Flattened:
//              Range     X_T38     X_T37  
//       d0        50         1        10  
//       d1        10         0         1  
//      off                   0         0  
//      vec                   1         1  
// 
// Names: { d0, d1 }
// Ranges: { 50, 10 }
// Out stride: { 1, 0 }
// Input 1 offset: 0
// Input 1 stride: { 10, 1 }
// Elementwise input X_T33 shape: i64(50):(1):400 bytes
// Elementwise op: X_T39 = as_int(X_T38, X_I_3_0)
// Elementwise op: X_T40 = cmp_eq(X_T33, X_T39)
// Elementwise op: X_T41 = cond(X_T40, X_T4, X_T3)
// Elementwise op: X_T42 = as_int(X_T41, X_T2)
// Elementwise op: X_T43 = as_float(X_T42, X_T1)
// Tile size: { 50, 10 }
// Contraction output var shape: i32(50):(1):200 bytes
// Computed true ops: 3500
// Computed work groups: 1
// Computed inner loops: 1
// Computed shared mem: 3024
// Computed out regs: 1024
// Computed mem read: 1936
// Computed mem write: 256
// Computed operations: 256
// Computed rollups: 3
// Computed threads used: 256
// lwork = 256, 1, 1
// gwork = 256, 1, 1

kernel void kernel_c6_sdk_16(
    device void* X_T43_arg_ [[ buffer(0) ]],
    device const void* in1_arg_ [[ buffer(1) ]],
    device const void* X_T33_arg_ [[ buffer(2) ]],
    uint _tid [[ thread_index_in_threadgroup ]],
    uint3 _groupid [[ threadgroup_position_in_grid ]],
    uint3 _globalid [[ thread_position_in_grid ]]
)
{
  device float* X_T43 = static_cast<device float*>(X_T43_arg_);
  device const int* in1 = static_cast<device const int*>(in1_arg_);
  device const ptrdiff_t* X_T33 = static_cast<device const ptrdiff_t*>(X_T33_arg_);
  int tid = _tid;
  int agg[1] = {INT_MAX, };
  threadgroup int in1_shared[500];
  for (int d1_gid = 0; d1_gid < 10; d1_gid += 10)
  {
    {
      int d1_d0_tid = (tid % 256);
      for (int d1_d0_lid = 0; d1_d0_lid < 2; d1_d0_lid += 1)
      {
        int d1_d0_cond = (int)((d1_d0_lid < 1) || (d1_d0_tid < 244));
        if (d1_d0_cond)
        {
          int d1_d0 = ((256 * d1_d0_lid) + d1_d0_tid);
          int gidx = (d1_gid + d1_d0);
          in1_shared[d1_d0] = in1[clamp(gidx, 0, 499)];
        }
      }
    }
    threadgroup_barrier(mem_flags::mem_threadgroup);
    int d1_tid = ((tid / 64) % 4);
    for (int d1_lid = 0; d1_lid < 3; d1_lid += 1)
    {
      int d1_cond = (int)((d1_lid < 2) || (d1_tid < 2));
      if (d1_cond)
      {
        int d1 = ((4 * d1_lid) + d1_tid);
        int d0_tid = (tid % 64);
        int d0_cond = (int)(d0_tid < 50);
        int d0 = select((int)0, (int)d0_tid, (int)d0_cond);
        int val1 = in1_shared[(d1 + (10 * d0))];
        int agg_rhs = select((int)agg[0], (int)val1, (bool)(val1 < agg[0]));
        agg[0] = select((int)agg[0], (int)agg_rhs, (bool)d0_cond);
      }
    }
    threadgroup_barrier(mem_flags::mem_threadgroup);
  }
  threadgroup int merge_shared[256];
  {
    merge_shared[tid] = agg[0];
    threadgroup_barrier(mem_flags::mem_threadgroup);
    if ((tid < 128))
    {
      merge_shared[tid] = select((int)merge_shared[tid], (int)merge_shared[(tid + 128)], (bool)(merge_shared[(tid + 128)] < merge_shared[tid]));
    }
    threadgroup_barrier(mem_flags::mem_threadgroup);
    if ((tid < 64))
    {
      merge_shared[tid] = select((int)merge_shared[tid], (int)merge_shared[(tid + 64)], (bool)(merge_shared[(tid + 64)] < merge_shared[tid]));
    }
    threadgroup_barrier(mem_flags::mem_threadgroup);
    if ((tid < 64))
    {
      agg[0] = merge_shared[tid];
    }
  }
  int d0_tid = (tid % 64);
  int d0_cond = (int)(d0_tid < 50);
  if (d0_cond)
  {
    if ((tid < 64))
    {
      int LX_T38 = agg[0];
      ptrdiff_t LX_T33 = X_T33[d0_tid];
      ptrdiff_t LX_T39 = LX_T38;
      bool LX_T40 = (LX_T33 == LX_T39);
      int LX_T41 = select((char)0, (char)1, (bool)LX_T40);
      char LX_T42 = LX_T41;
      float LX_T43 = (float)LX_T42;
      X_T43[d0_tid] = LX_T43;
    }
  }
}

2020-04-27 08:41:52.578057: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_utils.cc:408] Serializing graph to: tf_function_error_ngraph_cluster_23.json

Traceback (most recent call last):
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Caught exception while executing nGraph computation: Compiler encountered an internal error
     [[{{node ngraph_cluster_23}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/teo/ngraph-bridge/build_cmake/examples/mnist/mnist_deep_simplified.py", line 288, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/Users/teo/ngraph-bridge/build_cmake/examples/mnist/mnist_deep_simplified.py", line 245, in main
    train_mnist_cnn(FLAGS)
  File "/Users/teo/ngraph-bridge/build_cmake/examples/mnist/mnist_deep_simplified.py", line 211, in train_mnist_cnn
    keep_prob: 1.0
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 731, in eval
    return _eval_using_default_session(self, feed_dict, self.graph, session)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 5579, in _eval_using_default_session
    return session.run(tensors, feed_dict)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Caught exception while executing nGraph computation: Compiler encountered an internal error
     [[{{node ngraph_cluster_23}}]]

With device opencl_intel_hd_graphics_630.0 got error:

Extracting /tmp/teotensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Extracting /tmp/teotensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Extracting /tmp/teotensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Extracting /tmp/teotensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Saving graph to: /tmp/teo/tensorboard-logs/mnist-convnet
2020-04-27 08:44:12.478173: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-27 08:44:12.600435: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_rewrite_pass.cc:205] NGraph using backend: PlaidML
2020-04-27 08:44:13.708775: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_rewrite_pass.cc:205] NGraph using backend: PlaidML
2020-04-27 08:44:14.494447: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_utils.cc:408] Serializing graph to: tf_function_error_ngraph_cluster_23.json

Traceback (most recent call last):
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Caught exception while executing nGraph computation: Second argument of index must be an integer
     [[{{node ngraph_cluster_23}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/teo/ngraph-bridge/build_cmake/examples/mnist/mnist_deep_simplified.py", line 288, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/Users/teo/ngraph-bridge/build_cmake/examples/mnist/mnist_deep_simplified.py", line 245, in main
    train_mnist_cnn(FLAGS)
  File "/Users/teo/ngraph-bridge/build_cmake/examples/mnist/mnist_deep_simplified.py", line 211, in train_mnist_cnn
    keep_prob: 1.0
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 731, in eval
    return _eval_using_default_session(self, feed_dict, self.graph, session)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 5579, in _eval_using_default_session
    return session.run(tensors, feed_dict)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/Users/teo/ngraph-bridge/build_cmake/venv-tf-py3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Caught exception while executing nGraph computation: Second argument of index must be an integer
     [[{{node ngraph_cluster_23}}]]

However device opencl_amd_radeon_pro_560_compute_engine.0 for some reason works fine:

Extracting /tmp/teotensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Extracting /tmp/teotensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Extracting /tmp/teotensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Extracting /tmp/teotensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Saving graph to: /tmp/teo/tensorboard-logs/mnist-convnet
2020-04-27 08:45:42.243580: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-27 08:45:42.363910: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_rewrite_pass.cc:205] NGraph using backend: PlaidML
2020-04-27 08:45:42.891926: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_rewrite_pass.cc:205] NGraph using backend: PlaidML
step 0, training accuracy 0.1, 1.17623 sec to evaluate
2020-04-27 08:45:44.105810: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_rewrite_pass.cc:205] NGraph using backend: PlaidML
step 0, loss 2.35827, 3.73834 sec for training step
step 1, loss 2.36827, 0.172614 sec for training step
...
step 998, loss 0.0873338, 0.149775 sec for training step
step 999, loss 0.201209, 0.150949 sec for training step
Training finished. Running test
test accuracy 0.9601
2020-04-27 08:48:33.164103: I /Users/teo/ngraph-bridge/ngraph_bridge/ngraph_rewrite_pass.cc:205] NGraph using backend: PlaidML
zap-wizard commented 4 years ago

And for some reason Tensorflow <-> nGraph <-> PlaidML is a lot slower than Keras <-> PlaidML on the device that works opencl_amd_radeon_pro_560_compute_engine.0...

quidquid commented 4 years ago

I'm not sure about that error, but I'm successfully using TF + Plaid + nGraph on an MBP (Catalina). What version of TF are you using there? Here's a recipe that works for me, with both TF 1.14 and 1.15:

virtualenv ve-plaidml
source ve-plaidml/bin/activate
pip install plaidml-keras
pip install tensorflow==1.14.0
pip install ngraph-tensorflow-bridge
import os;
import tensorflow as tf;
import ngraph_bridge;
print('TensorFlow version: ',tf.__version__);
print(ngraph_bridge.__version__)

os.environ['PLAIDML_EXPERIMENTAL'] = '1'
os.environ['PLAIDML_DEVICE_IDS'] = 'metal_amd_radeon_pro_5500m.0'
ngraph_bridge.set_backend('PLAIDML')

# Now you can use tf/ng like usual:
config = tf.ConfigProto(log_device_placement=True, allow_soft_placement=True)
config = ngraph_bridge.update_config(config)

If you add those plaid config statements to the start of the python example, any luck?

zap-wizard commented 4 years ago

Thanks for help, but still not working. However I have figured out that there's only a few nn layers that are not working, because some of my models work correctly.