Closed hughperkins closed 7 years ago
Would be important to learn whether NVidia or Ubuntu are to blame, because NVidia have been accused of deliberately leaving their OpenCL drivers in a buggy/bad/slow state.
I ought to have a new AMD GPU later today, will try to test this out.
It would be interesting to consider a "software" fallback, though; e.g. an OpenCL kernel that, seeded by system entropy from /dev/urandom
, could generated ~CSPRNG output without relying on hardware entropy sources on the card. I wouldn't use it for crypto, but it would be fine for seeding random distributions? Some key-streams are very minimal and might be easy to implement, although most generate random ints, rather than floats. I'm not sure how irritating it is to type-cast from within OpenCL kernels.
By default, if we choose not to register a GPU kernel, it will use a CPU kernel in its place.
I'd be very interested to know if this is Ubutnu specific or NVIDIA specific. This is key information for deciding the future of this issue.
Just booted into my AMDGPU-pro driver environment for the AMD R9 390, Ubuntu 16.04, Intel i5, and ran tests:
cathal@thinkum:~/tf-coriander$ py.test
================================================================= test session starts =================================================================
platform linux -- Python 3.5.2, pytest-3.1.0, py-1.4.33, pluggy-0.4.0
rootdir: /home/cathal/tf-coriander, inifile: pytest.ini
plugins: pep8-1.0.6
collected 99 items
tensorflow/stream_executor/cl/test/conftest.py .
tensorflow/stream_executor/cl/test/measure_binary_ops_perf.py .
tensorflow/stream_executor/cl/test/measure_reduction_ops_perf_bybatchsize.py .
tensorflow/stream_executor/cl/test/measure_reductions_perf.py .
tensorflow/stream_executor/cl/test/measure_unary_ops_perf.py .
tensorflow/stream_executor/cl/test/measure_unary_ops_perf_bybatchsize.py .
tensorflow/stream_executor/cl/test/run_unary_op.py .
tensorflow/stream_executor/cl/test/test_binary_ops.py .xx................
tensorflow/stream_executor/cl/test/test_blas.py ..
tensorflow/stream_executor/cl/test/test_common.py .
tensorflow/stream_executor/cl/test/test_gradients.py ..
tensorflow/stream_executor/cl/test/test_loss.py ..
tensorflow/stream_executor/cl/test/test_misc.py .........s
tensorflow/stream_executor/cl/test/test_nn.py ..
tensorflow/stream_executor/cl/test/test_random.py .FF..
tensorflow/stream_executor/cl/test/test_reductions.py ...........................
tensorflow/stream_executor/cl/test/test_simple.py ..
tensorflow/stream_executor/cl/test/test_softmax.py ....
tensorflow/stream_executor/cl/test/test_unary_ops.py ................
------------------------------------- generated xml file: /home/cathal/tf-coriander/test/junit-pytest-report.xml --------------------------------------
=============================================================== short test summary info ===============================================================
FAIL tensorflow/stream_executor/cl/test/test_random.py::test_random_normal[shape0]
FAIL tensorflow/stream_executor/cl/test/test_random.py::test_random_normal[shape1]
SKIP [1] tensorflow/stream_executor/cl/test/test_misc.py:158: Need to fix passing float** to kernel for this to work
XFAIL tensorflow/stream_executor/cl/test/test_binary_ops.py::test[uint8-div-a / b]
XFAIL tensorflow/stream_executor/cl/test/test_binary_ops.py::test[uint8-mul-a * b]
====================================================================== FAILURES =======================================================================
_____________________________________________________________ test_random_normal[shape0] ______________________________________________________________
shape = (3, 4)
@pytest.mark.parametrize(
'shape',
shapes)
def test_random_normal(shape):
with tf.Graph().as_default():
with tf.device('/gpu:0'):
W_t = tf.Variable(tf.random_normal(shape))
mu_t = tf.reduce_mean(W_t)
var_t = tf.reduce_mean(W_t * W_t)
with tf.Session(config=tf.ConfigProto(log_device_placement=False)) as sess:
sess.run(tf.initialize_all_variables())
W, mu, var = sess.run((W_t, mu_t, var_t))
if np.prod(W.shape) < 20:
print('W', W)
else:
print('W.reshape(-1)[:20]', W.reshape(-1)[:20])
print('mu', mu, 'var', var)
assert abs(mu) < 1.0
> assert var > 0.05
E assert 0.0 > 0.05
tensorflow/stream_executor/cl/test/test_random.py:34: AssertionError
---------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------
W [[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
mu 0.0 var 0.0
---------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Hawaii, pci bus id: 0000.0000)
_____________________________________________________________ test_random_normal[shape1] ______________________________________________________________
shape = (50, 70, 12)
@pytest.mark.parametrize(
'shape',
shapes)
def test_random_normal(shape):
with tf.Graph().as_default():
with tf.device('/gpu:0'):
W_t = tf.Variable(tf.random_normal(shape))
mu_t = tf.reduce_mean(W_t)
var_t = tf.reduce_mean(W_t * W_t)
with tf.Session(config=tf.ConfigProto(log_device_placement=False)) as sess:
sess.run(tf.initialize_all_variables())
W, mu, var = sess.run((W_t, mu_t, var_t))
if np.prod(W.shape) < 20:
print('W', W)
else:
print('W.reshape(-1)[:20]', W.reshape(-1)[:20])
print('mu', mu, 'var', var)
assert abs(mu) < 1.0
> assert var > 0.05
E assert 0.0 > 0.05
tensorflow/stream_executor/cl/test/test_random.py:34: AssertionError
---------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------
W.reshape(-1)[:20] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.]
mu 0.0 var 0.0
---------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Hawaii, pci bus id: 0000.0000)
============================================== 2 failed, 94 passed, 1 skipped, 2 xfailed in 5.64 seconds ==============================================
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Hawaii, pci bus id: 0000.0000
c: /job:localhost/replica:0/task:0/gpu:0
b: /job:localhost/replica:0/task:0/gpu:0
a: /job:localhost/replica:0/task:0/gpu:0
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Hawaii, pci bus id: 0000.0000
c: /job:localhost/replica:0/task:0/gpu:0
b: /job:localhost/replica:0/task:0/gpu:0
a: /job:localhost/replica:0/task:0/gpu:0
Consider this the same output for #13 and #34 also; no segfaults when I manually ran the MNIST example, and the tests didn't all segfault either. :)
So aside from the RNG, things are looking very good for the most recent release! How much is the RNG relied-upon for the other tests, though? If the RNG is always returning 0, then how do the other tests pass? Is it simply not used for seeding the weights of networks, generally?
PS, I'll shortly be wiping and switching to the ROCm driver on a different card, so I don't know if I can continue providing useful test output for the AMDGPU-pro driver + R9 390, at least for a while (until I can afford another MoBo to plug it in! :) ).
If the RNG is always returning 0, then how do the other tests pass? Is it simply not used for seeding the weights of networks, generally?
Well, the nets will be poorly initialized, but they will still learn, slowly
Thank you for the information that this is a general problem on Ubuntu, not specific to NVIDIA hardware. I will take a look.
UPdates on this:
I'm going to run the algorithm on the cpu, and compare it to the gpu result, and see how that goes. The implementation in tensorflow is here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/lib/random/philox_random.h
Created simple cpu-side, easy-to-run, easy-to-introspect philoxrandom script at https://github.com/hughperkins/pub-prototyping/blob/master/cpp/testphilox.cpp , which calls into a copied/modified version of the tensorflow PhiloxRandom generator at https://github.com/hughperkins/pub-prototyping/blob/master/cpp/from_tf/philox_random.h
This then gives the same outputs as calling tf.random_uniform
, with seed 123:
tensroflow output:
W_cpu [[ 0.04080963 0.20842123 0.09180295 0.70220065]
[ 0.7073133 0.39646494 0.06650937 0.29188633]
[ 0.02963269 0.95492315 0.00610673 0.3169049 ]]
test script output:
0: 1887779136 0.0408096
1: 1293593996 0.208421
2: 3473653811 0.091803
3: 257548726 0.702201
4: 727353662 0.707313
5: 2159198045 0.396465
6: 3498607457 0.0665094
7: 3324337288 0.291886
8: 1115933441 0.0296327
9: 1274690284 0.954923
10: 3053504539 0.00610673
11: 3861418071 0.316905
I'm then going to work my way through the opencl kernel, comparing the values in the kernel, with those on the cpu, and finding where they start to differ. I'm going to use COCL_DUMP_CL
, COCL_LOAD_CL
, and COCL_DUMP_CONFIG
to inject values I'm interested in, into an output buffer, and inspect them. https://github.com/hughperkins/coriander/blob/master/doc/advanced_usage.md#runtime-options
Sounds really promising! Can't wait to see this one fixed, I have a feeling a lot of the test cases are "passing", but aren't really behaving as intended, because of poor initialisation. In the meantime I was planning to just monkey-patch this with the numpy-derived random hack, above, and see how that improves things. :P
The test cases dont need initialization: they test specific operations. Please check them, and let me know any test cases you feel are weak, and how we can improve them.
I mean, when you do py.test -v
, those test specific operations. The end-to-end tests are the scripts we are using from Aymeric Damien's Tensorflow-Examples.
But yeah, monkey-patching will work, as long as you dont use AdamOptimizer, which initializes itself from tf.random
. It'll initialize itself really poorly currenlty, and learning will suck. If you use SGD optimizer, it should work ok-ish.
Sorry, I meant the Tensorflow-Examples repo, yes. :) When I was running those previously I noticed that some of the examples weren't learning at any appreciable rate, which was probably down to the randn initialisation being poor. Or worse, if things were being set to 0, then perhaps a lot of neurons were simply dead. I'm running it all again after monkey-patching Numpy's randn in, and they seem to be learning again.
Good to know about Adam! Possibly the monkey-patch would have to be applied further up the chain to work on Adam, then.. This is all running in a virtualenv, so I don't mind doing radical surgery on the Random lib if it lets me use TF. :P
...I'm perusing the code for AdamOptimizer, but I don't actually see any calls to tf.random
in it or its superclass. Is this instead happening somewhere else, like in tf.initialize_all_valirables
?
Hmmm, just noticed my earlier reply failed to send:
Sorry, I meant the Tensorflow-Examples repo, yes. :) When I was running those previously I noticed that some of the examples weren't learning at any appreciable rate, which was probably down to the randn initialisation being poor
Yes thats true. You are right.
Possibly the monkey-patch would have to be applied further up the chain to work on Adam, then.
Possibly, but sounds a lot of work.
And thence onto new reply :-)
...I'm perusing the code for AdamOptimizer, but I don't actually see any calls to tf.random in it or its superclass. Is this instead happening somewhere else, like in tf.initialize_all_valirables?
So, what I would do would probably be in first instance to view the graph on the tensorboard. tensorboard rocks :-) . https://www.tensorflow.org/get_started/graph_viz
Great advice, thanks!
Update: tf.random_uniform
enhanced tests works on Mac now :-) . Means, tf.random_uniform
gives same results as cpu now, not like before. Fixed in https://github.com/hughperkins/tf-coriander/commit/4c1c54473926b3f8a9365bbbcd653041bb1108e2 , but specifically in https://github.com/hughperkins/coriander/commit/6452be7c5b7b426d88b3fb4c9e7f1d679b5ced6d
I'm hoping that the experience in PhiloxRandom for Mac will help me to fix the Ubuntu version.
(well... that's odd... the opencl generated on Mac is different than on Ubuntu. Thats quite unexpected ... since the opencl generation is my own code, therefore invariant, and it's based on the compilation using llvm-4.0, which is also invariant. Oh .... different standard libraries. Maybe for that)
Mac sample:
struct class_tensorflow__random__Array {
int f0[4];
};
struct class_tensorflow__random__Array_0 {
int f0[2];
};
struct class_tensorflow__random__NormalDistribution {
char f0;
};
struct class_tensorflow__random__PhiloxRandom {
struct class_tensorflow__random__Array f0;
struct class_tensorflow__random__Array_0 f1;
};
Ubuntu sample:
struct class_tensorflow__random__Array {
int f0[4];
};
struct class_tensorflow__random__Array_0 {
int f0[2];
};
struct class_tensorflow__random__NormalDistribution {
char f0;
};
struct class_tensorflow__random__Array_1 {
float f0[4];
};
struct class_tensorflow__random__PhiloxRandom {
struct class_tensorflow__random__Array f0;
struct class_tensorflow__random__Array_0 f1;
};
An entire extra struct. Exact same kernel being compiled...
Oh, I just found out (as I'm writing this) why random_normal is broken I reckon. There is a shim for sincosf
,but the mangled names are different, so on Mac it is correctly shimmed, but not on Ubuntu. That should be easy to fix...
End of OpenCL on Mac:
/* int v119 = phi v304 */
v119 = v304;
goto v3;
} else {
goto v10;
}
v10:;
goto v11;
v11:;
return;
}
End of OpenCL on Ubuntu:
goto v3;
} else {
goto v10;
}
v10:;
goto v11;
v11:;
return;
}
void _Z7sincosffPfS_(float v1, float* v2, float* v3, local int *scratch) {
}
That sincos
bit has a different mangled name than on Mac, and wasnt detected, but it's easy to tell Coriander the Ubuntu name, and thus detect it.
Oh... I've found the difference... in Tensorflow, in the random distributions code, they have the following:
#if defined(__linux__)
sincosf(v1, f0, f1);
#else
*f0 = sinf(v1);
*f1 = cosf(v1);
#endif
Different code, depending on linux or not :-O
Urgh, so this is an upstream issue, it assumes that Linux's version of available libs (CUDA?) have an extra function defined? Is that something fixable within the coriander framework, just by delegating to sin/cos?
It's easy to fix, in Coriander, by writing an appropriate shim. There are some shims already https://github.com/hughperkins/coriander/blob/master/src/shims.cpp
Well... easy - ish, since it will be the first shim that needs to handle pointers, and pointers in opencl need address spaces statically declared. That means, I'll need a shim for every combination of address spaces used by the code. ie, in opencl:
float angle = 0.1f;
global float *globala;
global float *globalb;
float *a;
float *b;
sincos(angle, a, b); // both are private
sincos(angle, globala, globalb); // both are global
sincos(angle, a, globalb); // one private, one global...
local float localb[30];
sincos(angle, a, loclalb); // one private, a shared...
Since OpenCL 1.2 is C99, you can only use the name sincos
once, for one pair of address spaces, so there'll need to be a unique function name, for every pair of address spaces used, like eg sincos_p_p
, sincos_g_g
, ... So I'll need to write code to handle that. It's ok, I have various ideas on how to handle this, not a challenging issue, but might take a few man-hours or so.
Fixed in https://github.com/hughperkins/tf-coriander/commit/7d2deb9869194b046d82834d77d1d358dd674532 :-) . Turned out didnt need any shims, or address space stuff. Yay :-)
This is great! So, aside from Splitting, this is almost fully-functioning Tensorflow on CL now? :o Congratulations!
split and conv. but yeah :-) Thanks! :-)
(well, "fully-functioning". more like "has enough functionality to run some basic conv nets". It'd still be nice to have eg batchnormalization, and some other things, ideally. )
Question: what model(s) do you need to run in priority? I cant do everything at once, so if I know which model(s) you are trageting in priority, I can look at those first.
Honestly, not sure at this stage. I'll probably be working mostly with text-based models, RNNs/LSTMs etcetera, for the moment. I'd love to start playing with convnets but I'm not at that yet. :)
So for me, even being able to correctly initialise nets is :balloon: :tada:
Cool :-)
created new v0.17.3 wheel. Dont suppose... do you mind double-checking that tf.random_normal
is working ok for you now? Also, maybe run all the unit tests too, ie py.test -v
?
Already on it! :)
:-) . By the way, I'm not sure if the RPATH stuff is correct on ubuntu. If it still fails, it might be because it's using the libcocl.so in /usr/local/lib. So, if it doesnt work after installign the wheel:
coriander
!= tf_coriander
. coriander is the underlying compiler)Looking good for the py.test
output. :)
I'm sorry to say I don't know what you're referring to re: "RPATH". Happy to help test, if you give me context and/or suggested shell commands to interrogate this.
Going to start running a few TensorFlow-Examples scripts now, also.
I'm sorry to say I don't know what you're referring to re: "RPATH". Happy to help test, if you give me context and/or suggested shell commands to interrogate this.
As long as the py.test -v
is passing, it's all good. Good news that that is passing now :-)
So far, all examples I've tried from TensorFlow-Examples have worked perfectly, after I shim out the GPU specifying context managers. I'm training convolutional_network.py
now and it's learing.. I don't know how well it's learning, but the loss generally is going down, and the accuracy is generally going up, so I'll mark that down as "success" for now. :)
Thanks again!
Cool! :-)
'tf.random_normal' broken on Ubuntu 16.04/NVIDIA