Upgrade to latest Tensorflow version

no140 commented 7 years ago

Hello, on Ubuntu 16.04, installed tensorflow-cl as per instructions in pip3. Keras is version 2.0.5, output error: `

/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py in _initialize_variables()
    298     """Utility to initialize uninitialized variables on the fly.
    299     """
--> 300     variables = tf.global_variables()
    301     uninitialized_variables = []
    302     for v in variables:

AttributeError: module 'tensorflow' has no attribute 'global_variables'

Seen somewhat similar issues online with fix to revert back to tf version 0.10 or upgrade to 0.12.

Has anyone seen this, or successfully used tf-cl in keras (version?)? Simple test importing tensorflow in python (no keras) seem to function okay.

hughperkins commented 7 years ago

I guess this could go down as 'enhancement', ie 'upgrade tensorflow-cl to 1.0/1.1/latest version'

hughperkins commented 7 years ago

I'll look into this once stuff is working on the current version :-)

hughperkins commented 7 years ago

If anyone wants to try this in the meantime, that could be extermely useful :-)

Mostly, changes to upstream tensorflow comprise:

(very few) bug fixes. There might not be any actually. There is a tiny change I made in the eventmgr, for the Mac build, but I'm not sure if the bug is certainly there, and not in Coriander somehow :-)
commenting out #if GOOGLE_CUDA, to enable operations. This is fairly trivial to port :-)
commenting out all usage of anything other than int32, bool and float on the gpu. In some cases int8 is also allowed. half and double are generally removed. int64 is a bit of an open question... minimize ideally, for now
some code to load opencl at the start, and run builds using Coriander. The code is almost entirely in tensorflow/stream_executor/cl. There are a few tooling changes

You can view the entire set of differences between the current tf-coriander version, and the upstream master it was based on, by going to https://github.com/hughperkins/tf-coriander/compare/orig-master...master and clicking on 'files changes'

a more controllable way might be to download both branches, itno two different directories, side-by-side, and then use meld or similar to compare them

Some additional background on the #if GOOGLE_CUDA is given in https://github.com/hughperkins/tf-coriander/blob/master/doc/enabling-operations.md

ghost commented 7 years ago

@no140 - I added my experiences with a ~compatible version of Keras to the tf-coriander Wiki: https://github.com/hughperkins/tf-coriander/wiki/Third-Party-Compatibility

So far, most tests pass, though 2 have to be disabled entirely because they segfault in tf-coriander and break the rest. Also, there is a seemingly random tendency for recurrent nets (maybe? Might be unrelated to layer/network type) to "freeze" after behaving normally at first.

hughperkins commented 7 years ago

Awesome, thanks! :-) . Excellent info on things that are/arent working.

ghost commented 7 years ago

Yea, I reckon using Keras' test-cases might be a useful way to tune tf-coriander at this stage, because Keras is the prototypical consumer of Tensorflow. It works remarkably well! Probably because Keras' abstract backend interface avoids a lot of the murky corners of Tensorflow's API which might work more poorly in Coriander.

One issue I'm having is I don't know how to profile the current used/available memory usage on my GPU (ROCm Driver), so I can't investigate whether there's a memory leak causing the freeze on LSTM training.

hughperkins commented 7 years ago

Yea, I reckon using Keras' test-cases might be a useful way to tune tf-coriander at this stage, because Keras is the prototypical consumer of Tensorflow. It works remarkably well! Probably because Keras' abstract backend interface avoids a lot of the murky corners of Tensorflow's API which might work more poorly in Coriander.

Cool, sounds good :-)

One issue I'm having is I don't know how to profile the current used/available memory usage on my GPU (ROCm Driver), so I can't investigate whether there's a memory leak causing the freeze on LSTM training.

One option which can give more information, at the expense of sppppaaammmmm is:

clone the latest coriander (note: coriander != tf-coriander. It's the compiler, here: https://github.com/hughperkins/coriander)
start the ccmake build configuration:
```
mkdir build
cd build
ccmake ..
```
press c to load build configuration
turn on option COCL_SPAM, and some or all of the COCL_SPAM_xxx options
press c another couple of times, until g appears, then press g to write out hte build configuraiton
build:
```
make -j 8
```

copy the libcocl.so file into your current python virtualenvironemtn. To find out which folder to copy it into, you can run:

python -c 'import tensorflow; import sys; print("\nPlease copy libcocl.so into:\n\n " + sys.modules["tensorflow"].path[0] + "/third_party/coriander\n")'

This will write out a ton of spam... but might give some insight into memory allocations etc.

ghost commented 7 years ago

I'll try this and will run the offending LSTM test cases to get logs, and will share them here. If Coriander requires CUDA to compile though, I won't be able to do so; I'll get back in a few hours with this hopefully. :)

hughperkins commented 7 years ago

Coriander does not require NVIDIA® CUDA™ Toolkit to compile.

ghost commented 7 years ago

Having great difficulty getting coriander to compile. Initially, just getting it to find the llvm libs took a bit of figuring out, because I don't normally work with C(++). But this error seems a bit tricker:

/home/cathal/localapps/coriander/src/new_instruction_dumper.cpp: In member function ‘void cocl::NewInstructionDumper::dumpCall(cocl::LocalValueInfo*, const std::map<llvm::Function*, llvm::Type*>&)’:
/home/cathal/localapps/coriander/src/new_instruction_dumper.cpp:1231:44: error: too few arguments to function ‘llvm::Function* llvm::CloneFunction(const llvm::Function*, llvm::ValueToValueMapTy&, bool, llvm::ClonedCodeInfo*)’
                                    valueMap);
                                            ^
In file included from /home/cathal/localapps/coriander/src/new_instruction_dumper.cpp:16:0:
/usr/lib/llvm-3.8/include/llvm/Transforms/Utils/Cloning.h:129:11: note: declared here
 Function *CloneFunction(const Function *F, ValueToValueMapTy &VMap,

Any pointers to help solve this? I'm using LLVM 3.8 as installed by Ubuntu.

ghost commented 7 years ago

Ah, nm: I just saw here that it's LLVM4 or nothing: https://github.com/hughperkins/coriander/issues/25#issuecomment-304672757

ghost commented 7 years ago

But, but, but... https://github.com/hughperkins/coriander/blob/master/docker/Dockerfile

? So should this compile in 3.8, or is that Dockerfile outdated?

hughperkins commented 7 years ago

Dockerfile is outdated

On 5 June 2017 10:46:22 BST, Cathal Garvey notifications@github.com wrote:

But, but, but... https://github.com/hughperkins/coriander/blob/master/docker/Dockerfile

? So should this compile in 3.8, or is that Dockerfile outdated?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/hughperkins/tf-coriander/issues/28#issuecomment-306147818

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

hughperkins commented 7 years ago

If you get a moment to update the Dockerfile, that could be neat :-D

hughperkins commented 7 years ago

(or you mean, you're not using the Dockerfile, just citing it as hopeful evidence that Coriander can compile in 3.8? :-D )

hughperkins commented 7 years ago

(updated the README to point out they're updated, and need updating to llvm-4.0 https://github.com/hughperkins/coriander/commit/4fb1f57f16723f939504c69dca0951763c1e5759 )

ghost commented 7 years ago

The latter! But I'm happy to offer PRs when this builds for me locally. I already have suggestions. :)

ghost commented 7 years ago

OK, finally have this built. I'll temporarily install the spam version over my current one, will run the offending Keras tests, and will collect them for inspection. :)

hughperkins commented 7 years ago

Cool :-) . Sounds good :-)

ghost commented 7 years ago

Hrm hrm hrm.. py.test appears to eat all of the spam output, will need to dig into the PyTest docs a bit.

hughperkins commented 7 years ago

py.test -svx

hughperkins commented 7 years ago

the -s prints stdout. It's horrible though. So, normally I make it runnable directly, with python, eg see https://github.com/hughperkins/tf-coriander/blob/master/tensorflow/stream_executor/cl/test/test_random.py#L64-L68

if __name__ == '__main__':
    if len(sys.argv) == 1:
        print('Please run using py.test')
    else:
        eval('%s((3, 4))' % sys.argv[1])

hughperkins commented 7 years ago

(in py.test, you can also use -k to choose one specific test. and you can run a specific test file by path, eg:

py.test -sv tensorflow/stream_executor/cl/test/test_random.py -k truncated

ghost commented 7 years ago

Keras uses a certain amount of custom test magic that I haven't dug into yet, so the simplest thing is to continue using py.test?

Thanks for the pointers; I've collected the test output for keras-1.1.1/tests/keras/layers/test_recurrent.py, attached (~15Mb uncompressed.. O_o).

recurrent_test_output.txt.tar.gz

hughperkins commented 7 years ago

Ok. Seems these are all passing? I searched for error, case-insensitve, but found nothing?

ghost commented 7 years ago

Those are the tests that pass, but sometimes freeze. So I'm suspicious of memory leaks as a potential cause? But, why it would run OK some times and not others, is mysterious.

I'll open a separate issue in coriander for the segfaulting tests, shall I? I can provide spammy output from py.test and the failing kernel in both cases.

ghost commented 7 years ago

The freeze-midway thing also happened when I was doing an embedding + LSTM classification task; trained for plenty of batches/epochs, and then suddenly stopped and simply hung on a certain batch. I was patient but it never recovered after 10m+, so I killed it..

hughperkins commented 7 years ago

freezing sounds like a loop that never terminates. Probably needs a ton of debugging on my side.

hughperkins commented 7 years ago

(ie, inside a kernel, there's probably something that loops forever, for some reason...)

hughperkins commented 7 years ago

right, I'm going to turn off my email, and try to focus on the split star star issue, instead of procrastinating :-P

ghost commented 7 years ago

freezing sounds like a loop that never terminates. Probably needs a ton of debugging on my side.

:/ I'm sorry this isn't something I have the know-how to help with directly!

hughperkins commented 7 years ago

Thats fair.

hughperkins commented 7 years ago

So, good/bad news is that autoencoder.py hangs for me on Mac

bad because, well, it doesnt work for me
good means, maybe if I fix the issue, on my Mac, it might fix the issue you are experiencing???

(by the way, when I say 'hangs' I mean, my computer keeps running, but I have to do a full Mac restart, before I can run any tensorflow code at all on that gpu :-P , no way of ctrl-C'ing the process )

vktt commented 7 years ago

Hey guys,

What's the status on this? Also, why do you guys have two separate branches for v1.2.1:

v1.2.1-branch & v1.2.1-cl?

Is one of those branches in somewhat working condition?

hughperkins commented 7 years ago

v1.2.1-branch is the original from upstream. v1.2.1-cl is an attempt to start merging in the opencl stuff.

Neither is in somewhat working condition.

I'm working on other things at the moment. Open opportunity for someone to start looking into this.

QtRoS commented 7 years ago

What is the strategy of merging with upstream? Every time you must do a lot of work of manual steps? Do you have any ideas how to automate most of the routine work? I want to help, but in my opinion it can be very time consuming if you do it time to time, version to version, especially while tensorflow releases now appear much often.

hughperkins commented 7 years ago

well... ideally there should be very little to do, since almost all the original work is in coriander. And if you look at the commits, almost all the changes come down to:

uncommenting #ifdef GOOGLE_CUDA, eg https://github.com/hughperkins/tf-coriander/commit/a1813a12b2f30b47b7f0c7b38bc77d72357fe6d2 and
updating to a new version of coriander, eg https://github.com/hughperkins/tf-coriander/commit/1986d6aad93e2b3ad432118291cde4ee7ce99504

However, there are a few exceptions to this:

for better or for worse, I copied and pasted the cuda setup files, from tensorflow/stream_executor/cuda into tensorflow/stream_executor/cl, https://github.com/hughperkins/tf-coriander/tree/master/tensorflow/stream_executor/cl . I dont know if this is a good idea or not, but it seemed easier at the time...
there were a couple of bugs I came across, that I couldnt figure out how to fix without hacking on the original tensorflow code, eg https://github.com/hughperkins/tf-coriander/commit/e97b99408178d8ae5da4fae4231d5a702a3a7af1 (just like 2-3 things like this, almost all involving the same bit of code actually...)
a bunch of code for creating the python package, linking with coriander etc, needed a bit of hacking around
- these files are not part of the c++ code, but there are a few of them, eg:

QtRoS commented 7 years ago

Sorry, I am out of business because of old linux kernel without OpenCL support :(

hughperkins commented 7 years ago

Fair enough

hughperkins / tf-coriander

Upgrade to latest Tensorflow version #28