hughperkins / tf-coriander

OpenCL 1.2 implementation for Tensorflow
Apache License 2.0
791 stars 90 forks source link

'tf.random_normal' broken on Ubuntu 16.04 #35

Closed hughperkins closed 7 years ago

hughperkins commented 7 years ago

'tf.random_normal' broken on Ubuntu 16.04/NVIDIA

ghost commented 7 years ago

Would be important to learn whether NVidia or Ubuntu are to blame, because NVidia have been accused of deliberately leaving their OpenCL drivers in a buggy/bad/slow state.

I ought to have a new AMD GPU later today, will try to test this out.

It would be interesting to consider a "software" fallback, though; e.g. an OpenCL kernel that, seeded by system entropy from /dev/urandom, could generated ~CSPRNG output without relying on hardware entropy sources on the card. I wouldn't use it for crypto, but it would be fine for seeding random distributions? Some key-streams are very minimal and might be easy to implement, although most generate random ints, rather than floats. I'm not sure how irritating it is to type-cast from within OpenCL kernels.

hughperkins commented 7 years ago

By default, if we choose not to register a GPU kernel, it will use a CPU kernel in its place.

I'd be very interested to know if this is Ubutnu specific or NVIDIA specific. This is key information for deciding the future of this issue.

ghost commented 7 years ago

Just booted into my AMDGPU-pro driver environment for the AMD R9 390, Ubuntu 16.04, Intel i5, and ran tests:

cathal@thinkum:~/tf-coriander$ py.test
================================================================= test session starts =================================================================
platform linux -- Python 3.5.2, pytest-3.1.0, py-1.4.33, pluggy-0.4.0
rootdir: /home/cathal/tf-coriander, inifile: pytest.ini
plugins: pep8-1.0.6
collected 99 items 

tensorflow/stream_executor/cl/test/conftest.py .
tensorflow/stream_executor/cl/test/measure_binary_ops_perf.py .
tensorflow/stream_executor/cl/test/measure_reduction_ops_perf_bybatchsize.py .
tensorflow/stream_executor/cl/test/measure_reductions_perf.py .
tensorflow/stream_executor/cl/test/measure_unary_ops_perf.py .
tensorflow/stream_executor/cl/test/measure_unary_ops_perf_bybatchsize.py .
tensorflow/stream_executor/cl/test/run_unary_op.py .
tensorflow/stream_executor/cl/test/test_binary_ops.py .xx................
tensorflow/stream_executor/cl/test/test_blas.py ..
tensorflow/stream_executor/cl/test/test_common.py .
tensorflow/stream_executor/cl/test/test_gradients.py ..
tensorflow/stream_executor/cl/test/test_loss.py ..
tensorflow/stream_executor/cl/test/test_misc.py .........s
tensorflow/stream_executor/cl/test/test_nn.py ..
tensorflow/stream_executor/cl/test/test_random.py .FF..
tensorflow/stream_executor/cl/test/test_reductions.py ...........................
tensorflow/stream_executor/cl/test/test_simple.py ..
tensorflow/stream_executor/cl/test/test_softmax.py ....
tensorflow/stream_executor/cl/test/test_unary_ops.py ................

------------------------------------- generated xml file: /home/cathal/tf-coriander/test/junit-pytest-report.xml --------------------------------------
=============================================================== short test summary info ===============================================================
FAIL tensorflow/stream_executor/cl/test/test_random.py::test_random_normal[shape0]
FAIL tensorflow/stream_executor/cl/test/test_random.py::test_random_normal[shape1]
SKIP [1] tensorflow/stream_executor/cl/test/test_misc.py:158: Need to fix passing float** to kernel for this to work
XFAIL tensorflow/stream_executor/cl/test/test_binary_ops.py::test[uint8-div-a / b]
XFAIL tensorflow/stream_executor/cl/test/test_binary_ops.py::test[uint8-mul-a * b]
====================================================================== FAILURES =======================================================================
_____________________________________________________________ test_random_normal[shape0] ______________________________________________________________

shape = (3, 4)

    @pytest.mark.parametrize(
        'shape',
        shapes)
    def test_random_normal(shape):
        with tf.Graph().as_default():
            with tf.device('/gpu:0'):
                W_t = tf.Variable(tf.random_normal(shape))
                mu_t = tf.reduce_mean(W_t)
                var_t = tf.reduce_mean(W_t * W_t)

                with tf.Session(config=tf.ConfigProto(log_device_placement=False)) as sess:
                    sess.run(tf.initialize_all_variables())
                    W, mu, var = sess.run((W_t, mu_t, var_t))
                if np.prod(W.shape) < 20:
                    print('W', W)
                else:
                    print('W.reshape(-1)[:20]', W.reshape(-1)[:20])
                print('mu', mu, 'var', var)
                assert abs(mu) < 1.0
>               assert var > 0.05
E               assert 0.0 > 0.05

tensorflow/stream_executor/cl/test/test_random.py:34: AssertionError
---------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------
W [[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
mu 0.0 var 0.0
---------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Hawaii, pci bus id: 0000.0000)
_____________________________________________________________ test_random_normal[shape1] ______________________________________________________________

shape = (50, 70, 12)

    @pytest.mark.parametrize(
        'shape',
        shapes)
    def test_random_normal(shape):
        with tf.Graph().as_default():
            with tf.device('/gpu:0'):
                W_t = tf.Variable(tf.random_normal(shape))
                mu_t = tf.reduce_mean(W_t)
                var_t = tf.reduce_mean(W_t * W_t)

                with tf.Session(config=tf.ConfigProto(log_device_placement=False)) as sess:
                    sess.run(tf.initialize_all_variables())
                    W, mu, var = sess.run((W_t, mu_t, var_t))
                if np.prod(W.shape) < 20:
                    print('W', W)
                else:
                    print('W.reshape(-1)[:20]', W.reshape(-1)[:20])
                print('mu', mu, 'var', var)
                assert abs(mu) < 1.0
>               assert var > 0.05
E               assert 0.0 > 0.05

tensorflow/stream_executor/cl/test/test_random.py:34: AssertionError
---------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------
W.reshape(-1)[:20] [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.]
mu 0.0 var 0.0
---------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Hawaii, pci bus id: 0000.0000)
============================================== 2 failed, 94 passed, 1 skipped, 2 xfailed in 5.64 seconds ==============================================
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Hawaii, pci bus id: 0000.0000
c: /job:localhost/replica:0/task:0/gpu:0
b: /job:localhost/replica:0/task:0/gpu:0
a: /job:localhost/replica:0/task:0/gpu:0
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Hawaii, pci bus id: 0000.0000
c: /job:localhost/replica:0/task:0/gpu:0
b: /job:localhost/replica:0/task:0/gpu:0
a: /job:localhost/replica:0/task:0/gpu:0

Consider this the same output for #13 and #34 also; no segfaults when I manually ran the MNIST example, and the tests didn't all segfault either. :)

So aside from the RNG, things are looking very good for the most recent release! How much is the RNG relied-upon for the other tests, though? If the RNG is always returning 0, then how do the other tests pass? Is it simply not used for seeding the weights of networks, generally?

ghost commented 7 years ago

PS, I'll shortly be wiping and switching to the ROCm driver on a different card, so I don't know if I can continue providing useful test output for the AMDGPU-pro driver + R9 390, at least for a while (until I can afford another MoBo to plug it in! :) ).

hughperkins commented 7 years ago

If the RNG is always returning 0, then how do the other tests pass? Is it simply not used for seeding the weights of networks, generally?

Well, the nets will be poorly initialized, but they will still learn, slowly

Thank you for the information that this is a general problem on Ubuntu, not specific to NVIDIA hardware. I will take a look.

hughperkins commented 7 years ago

UPdates on this:

I'm going to run the algorithm on the cpu, and compare it to the gpu result, and see how that goes. The implementation in tensorflow is here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/lib/random/philox_random.h

hughperkins commented 7 years ago

Created simple cpu-side, easy-to-run, easy-to-introspect philoxrandom script at https://github.com/hughperkins/pub-prototyping/blob/master/cpp/testphilox.cpp , which calls into a copied/modified version of the tensorflow PhiloxRandom generator at https://github.com/hughperkins/pub-prototyping/blob/master/cpp/from_tf/philox_random.h

This then gives the same outputs as calling tf.random_uniform, with seed 123:

tensroflow output:

W_cpu [[ 0.04080963  0.20842123  0.09180295  0.70220065]
 [ 0.7073133   0.39646494  0.06650937  0.29188633]
 [ 0.02963269  0.95492315  0.00610673  0.3169049 ]]

test script output:

0: 1887779136 0.0408096
1: 1293593996 0.208421
2: 3473653811 0.091803
3: 257548726 0.702201
4: 727353662 0.707313
5: 2159198045 0.396465
6: 3498607457 0.0665094
7: 3324337288 0.291886
8: 1115933441 0.0296327
9: 1274690284 0.954923
10: 3053504539 0.00610673
11: 3861418071 0.316905

I'm then going to work my way through the opencl kernel, comparing the values in the kernel, with those on the cpu, and finding where they start to differ. I'm going to use COCL_DUMP_CL, COCL_LOAD_CL, and COCL_DUMP_CONFIG to inject values I'm interested in, into an output buffer, and inspect them. https://github.com/hughperkins/coriander/blob/master/doc/advanced_usage.md#runtime-options

ghost commented 7 years ago

Sounds really promising! Can't wait to see this one fixed, I have a feeling a lot of the test cases are "passing", but aren't really behaving as intended, because of poor initialisation. In the meantime I was planning to just monkey-patch this with the numpy-derived random hack, above, and see how that improves things. :P

hughperkins commented 7 years ago

The test cases dont need initialization: they test specific operations. Please check them, and let me know any test cases you feel are weak, and how we can improve them.

hughperkins commented 7 years ago

I mean, when you do py.test -v, those test specific operations. The end-to-end tests are the scripts we are using from Aymeric Damien's Tensorflow-Examples.

hughperkins commented 7 years ago

But yeah, monkey-patching will work, as long as you dont use AdamOptimizer, which initializes itself from tf.random. It'll initialize itself really poorly currenlty, and learning will suck. If you use SGD optimizer, it should work ok-ish.

ghost commented 7 years ago

Sorry, I meant the Tensorflow-Examples repo, yes. :) When I was running those previously I noticed that some of the examples weren't learning at any appreciable rate, which was probably down to the randn initialisation being poor. Or worse, if things were being set to 0, then perhaps a lot of neurons were simply dead. I'm running it all again after monkey-patching Numpy's randn in, and they seem to be learning again.

Good to know about Adam! Possibly the monkey-patch would have to be applied further up the chain to work on Adam, then.. This is all running in a virtualenv, so I don't mind doing radical surgery on the Random lib if it lets me use TF. :P

ghost commented 7 years ago

...I'm perusing the code for AdamOptimizer, but I don't actually see any calls to tf.random in it or its superclass. Is this instead happening somewhere else, like in tf.initialize_all_valirables?

hughperkins commented 7 years ago

Hmmm, just noticed my earlier reply failed to send:

Sorry, I meant the Tensorflow-Examples repo, yes. :) When I was running those previously I noticed that some of the examples weren't learning at any appreciable rate, which was probably down to the randn initialisation being poor

Yes thats true. You are right.

Possibly the monkey-patch would have to be applied further up the chain to work on Adam, then.

Possibly, but sounds a lot of work.

And thence onto new reply :-)

...I'm perusing the code for AdamOptimizer, but I don't actually see any calls to tf.random in it or its superclass. Is this instead happening somewhere else, like in tf.initialize_all_valirables?

So, what I would do would probably be in first instance to view the graph on the tensorboard. tensorboard rocks :-) . https://www.tensorflow.org/get_started/graph_viz

ghost commented 7 years ago

Great advice, thanks!

hughperkins commented 7 years ago

Update: tf.random_uniform enhanced tests works on Mac now :-) . Means, tf.random_uniform gives same results as cpu now, not like before. Fixed in https://github.com/hughperkins/tf-coriander/commit/4c1c54473926b3f8a9365bbbcd653041bb1108e2 , but specifically in https://github.com/hughperkins/coriander/commit/6452be7c5b7b426d88b3fb4c9e7f1d679b5ced6d

I'm hoping that the experience in PhiloxRandom for Mac will help me to fix the Ubuntu version.

hughperkins commented 7 years ago

(well... that's odd... the opencl generated on Mac is different than on Ubuntu. Thats quite unexpected ... since the opencl generation is my own code, therefore invariant, and it's based on the compilation using llvm-4.0, which is also invariant. Oh .... different standard libraries. Maybe for that)

Mac sample:

struct class_tensorflow__random__Array {
    int f0[4];
};
struct class_tensorflow__random__Array_0 {
    int f0[2];
};
struct class_tensorflow__random__NormalDistribution {
    char f0;
};
struct class_tensorflow__random__PhiloxRandom {
    struct class_tensorflow__random__Array f0;
    struct class_tensorflow__random__Array_0 f1;
};

Ubuntu sample:

struct class_tensorflow__random__Array {
    int f0[4];
};
struct class_tensorflow__random__Array_0 {
    int f0[2];
};
struct class_tensorflow__random__NormalDistribution {
    char f0;
};
struct class_tensorflow__random__Array_1 {
    float f0[4];
};
struct class_tensorflow__random__PhiloxRandom {
    struct class_tensorflow__random__Array f0;
    struct class_tensorflow__random__Array_0 f1;
};

An entire extra struct. Exact same kernel being compiled...

Oh, I just found out (as I'm writing this) why random_normal is broken I reckon. There is a shim for sincosf,but the mangled names are different, so on Mac it is correctly shimmed, but not on Ubuntu. That should be easy to fix...

End of OpenCL on Mac:

        /* int v119 = phi v304 */
        v119 = v304;
        goto v3;
    } else {
        goto v10;
    }
v10:;
    goto v11;
v11:;
    return;
}

End of OpenCL on Ubuntu:

        goto v3;
    } else {
        goto v10;
    }
v10:;
    goto v11;
v11:;
    return;
}
void _Z7sincosffPfS_(float v1, float* v2, float* v3, local int *scratch) {

}

That sincos bit has a different mangled name than on Mac, and wasnt detected, but it's easy to tell Coriander the Ubuntu name, and thus detect it.

hughperkins commented 7 years ago

Oh... I've found the difference... in Tensorflow, in the random distributions code, they have the following:

#if defined(__linux__)
  sincosf(v1, f0, f1);
#else
  *f0 = sinf(v1);
  *f1 = cosf(v1);
#endif

Different code, depending on linux or not :-O

https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/core/lib/random/random_distributions.h#L487-L492

ghost commented 7 years ago

Urgh, so this is an upstream issue, it assumes that Linux's version of available libs (CUDA?) have an extra function defined? Is that something fixable within the coriander framework, just by delegating to sin/cos?

hughperkins commented 7 years ago

It's easy to fix, in Coriander, by writing an appropriate shim. There are some shims already https://github.com/hughperkins/coriander/blob/master/src/shims.cpp

Well... easy - ish, since it will be the first shim that needs to handle pointers, and pointers in opencl need address spaces statically declared. That means, I'll need a shim for every combination of address spaces used by the code. ie, in opencl:

float angle = 0.1f;
global float *globala;
global float *globalb;
float *a;
float *b;

sincos(angle, a, b); // both are private
sincos(angle, globala, globalb); // both are global
sincos(angle, a, globalb); // one private, one global...

local float localb[30];
sincos(angle, a, loclalb); // one private, a shared...

Since OpenCL 1.2 is C99, you can only use the name sincos once, for one pair of address spaces, so there'll need to be a unique function name, for every pair of address spaces used, like eg sincos_p_p, sincos_g_g, ... So I'll need to write code to handle that. It's ok, I have various ideas on how to handle this, not a challenging issue, but might take a few man-hours or so.

hughperkins commented 7 years ago

Fixed in https://github.com/hughperkins/tf-coriander/commit/7d2deb9869194b046d82834d77d1d358dd674532 :-) . Turned out didnt need any shims, or address space stuff. Yay :-)

screen shot 2017-06-02 at 10 55 57 am
ghost commented 7 years ago

This is great! So, aside from Splitting, this is almost fully-functioning Tensorflow on CL now? :o Congratulations!

hughperkins commented 7 years ago

split and conv. but yeah :-) Thanks! :-)

hughperkins commented 7 years ago

(well, "fully-functioning". more like "has enough functionality to run some basic conv nets". It'd still be nice to have eg batchnormalization, and some other things, ideally. )

hughperkins commented 7 years ago

Question: what model(s) do you need to run in priority? I cant do everything at once, so if I know which model(s) you are trageting in priority, I can look at those first.

ghost commented 7 years ago

Honestly, not sure at this stage. I'll probably be working mostly with text-based models, RNNs/LSTMs etcetera, for the moment. I'd love to start playing with convnets but I'm not at that yet. :)

So for me, even being able to correctly initialise nets is :balloon: :tada:

hughperkins commented 7 years ago

Cool :-)

hughperkins commented 7 years ago

created new v0.17.3 wheel. Dont suppose... do you mind double-checking that tf.random_normal is working ok for you now? Also, maybe run all the unit tests too, ie py.test -v?

ghost commented 7 years ago

Already on it! :)

hughperkins commented 7 years ago

:-) . By the way, I'm not sure if the RPATH stuff is correct on ubuntu. If it still fails, it might be because it's using the libcocl.so in /usr/local/lib. So, if it doesnt work after installign the wheel:

ghost commented 7 years ago

Looking good for the py.test output. :)

I'm sorry to say I don't know what you're referring to re: "RPATH". Happy to help test, if you give me context and/or suggested shell commands to interrogate this.

Going to start running a few TensorFlow-Examples scripts now, also.

test_results.txt

hughperkins commented 7 years ago

I'm sorry to say I don't know what you're referring to re: "RPATH". Happy to help test, if you give me context and/or suggested shell commands to interrogate this.

As long as the py.test -v is passing, it's all good. Good news that that is passing now :-)

ghost commented 7 years ago

So far, all examples I've tried from TensorFlow-Examples have worked perfectly, after I shim out the GPU specifying context managers. I'm training convolutional_network.py now and it's learing.. I don't know how well it's learning, but the loss generally is going down, and the accuracy is generally going up, so I'll mark that down as "success" for now. :)

Thanks again!

hughperkins commented 7 years ago

Cool! :-)