hughperkins / tf-coriander

OpenCL 1.2 implementation for Tensorflow
Apache License 2.0
791 stars 90 forks source link

Upgrade to latest Tensorflow version #28

Closed no140 closed 7 years ago

no140 commented 7 years ago

Hello, on Ubuntu 16.04, installed tensorflow-cl as per instructions in pip3. Keras is version 2.0.5, output error: `

/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py in _initialize_variables()
    298     """Utility to initialize uninitialized variables on the fly.
    299     """
--> 300     variables = tf.global_variables()
    301     uninitialized_variables = []
    302     for v in variables:

AttributeError: module 'tensorflow' has no attribute 'global_variables'

Seen somewhat similar issues online with fix to revert back to tf version 0.10 or upgrade to 0.12.

Has anyone seen this, or successfully used tf-cl in keras (version?)? Simple test importing tensorflow in python (no keras) seem to function okay.

hughperkins commented 7 years ago

I guess this could go down as 'enhancement', ie 'upgrade tensorflow-cl to 1.0/1.1/latest version'

hughperkins commented 7 years ago

I'll look into this once stuff is working on the current version :-)

hughperkins commented 7 years ago

If anyone wants to try this in the meantime, that could be extermely useful :-)

Mostly, changes to upstream tensorflow comprise:

You can view the entire set of differences between the current tf-coriander version, and the upstream master it was based on, by going to https://github.com/hughperkins/tf-coriander/compare/orig-master...master and clicking on 'files changes'

Some additional background on the #if GOOGLE_CUDA is given in https://github.com/hughperkins/tf-coriander/blob/master/doc/enabling-operations.md

ghost commented 7 years ago

@no140 - I added my experiences with a ~compatible version of Keras to the tf-coriander Wiki: https://github.com/hughperkins/tf-coriander/wiki/Third-Party-Compatibility

So far, most tests pass, though 2 have to be disabled entirely because they segfault in tf-coriander and break the rest. Also, there is a seemingly random tendency for recurrent nets (maybe? Might be unrelated to layer/network type) to "freeze" after behaving normally at first.

hughperkins commented 7 years ago

Awesome, thanks! :-) . Excellent info on things that are/arent working.

ghost commented 7 years ago

Yea, I reckon using Keras' test-cases might be a useful way to tune tf-coriander at this stage, because Keras is the prototypical consumer of Tensorflow. It works remarkably well! Probably because Keras' abstract backend interface avoids a lot of the murky corners of Tensorflow's API which might work more poorly in Coriander.

One issue I'm having is I don't know how to profile the current used/available memory usage on my GPU (ROCm Driver), so I can't investigate whether there's a memory leak causing the freeze on LSTM training.

hughperkins commented 7 years ago

Yea, I reckon using Keras' test-cases might be a useful way to tune tf-coriander at this stage, because Keras is the prototypical consumer of Tensorflow. It works remarkably well! Probably because Keras' abstract backend interface avoids a lot of the murky corners of Tensorflow's API which might work more poorly in Coriander.

Cool, sounds good :-)

One issue I'm having is I don't know how to profile the current used/available memory usage on my GPU (ROCm Driver), so I can't investigate whether there's a memory leak causing the freeze on LSTM training.

One option which can give more information, at the expense of sppppaaammmmm is:

This will write out a ton of spam... but might give some insight into memory allocations etc.

ghost commented 7 years ago

I'll try this and will run the offending LSTM test cases to get logs, and will share them here. If Coriander requires CUDA to compile though, I won't be able to do so; I'll get back in a few hours with this hopefully. :)

hughperkins commented 7 years ago

Coriander does not require NVIDIA® CUDA™ Toolkit to compile.

ghost commented 7 years ago

Having great difficulty getting coriander to compile. Initially, just getting it to find the llvm libs took a bit of figuring out, because I don't normally work with C(++). But this error seems a bit tricker:

/home/cathal/localapps/coriander/src/new_instruction_dumper.cpp: In member function ‘void cocl::NewInstructionDumper::dumpCall(cocl::LocalValueInfo*, const std::map<llvm::Function*, llvm::Type*>&)’:
/home/cathal/localapps/coriander/src/new_instruction_dumper.cpp:1231:44: error: too few arguments to function ‘llvm::Function* llvm::CloneFunction(const llvm::Function*, llvm::ValueToValueMapTy&, bool, llvm::ClonedCodeInfo*)’
                                    valueMap);
                                            ^
In file included from /home/cathal/localapps/coriander/src/new_instruction_dumper.cpp:16:0:
/usr/lib/llvm-3.8/include/llvm/Transforms/Utils/Cloning.h:129:11: note: declared here
 Function *CloneFunction(const Function *F, ValueToValueMapTy &VMap,

Any pointers to help solve this? I'm using LLVM 3.8 as installed by Ubuntu.

ghost commented 7 years ago

Ah, nm: I just saw here that it's LLVM4 or nothing: https://github.com/hughperkins/coriander/issues/25#issuecomment-304672757

ghost commented 7 years ago

But, but, but... https://github.com/hughperkins/coriander/blob/master/docker/Dockerfile

? So should this compile in 3.8, or is that Dockerfile outdated?

hughperkins commented 7 years ago

Dockerfile is outdated

On 5 June 2017 10:46:22 BST, Cathal Garvey notifications@github.com wrote:

But, but, but... https://github.com/hughperkins/coriander/blob/master/docker/Dockerfile

? So should this compile in 3.8, or is that Dockerfile outdated?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/hughperkins/tf-coriander/issues/28#issuecomment-306147818

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

hughperkins commented 7 years ago

If you get a moment to update the Dockerfile, that could be neat :-D

hughperkins commented 7 years ago

(or you mean, you're not using the Dockerfile, just citing it as hopeful evidence that Coriander can compile in 3.8? :-D )

hughperkins commented 7 years ago

(updated the README to point out they're updated, and need updating to llvm-4.0 https://github.com/hughperkins/coriander/commit/4fb1f57f16723f939504c69dca0951763c1e5759 )

ghost commented 7 years ago

The latter! But I'm happy to offer PRs when this builds for me locally. I already have suggestions. :)

ghost commented 7 years ago

OK, finally have this built. I'll temporarily install the spam version over my current one, will run the offending Keras tests, and will collect them for inspection. :)

hughperkins commented 7 years ago

Cool :-) . Sounds good :-)

ghost commented 7 years ago

Hrm hrm hrm.. py.test appears to eat all of the spam output, will need to dig into the PyTest docs a bit.

hughperkins commented 7 years ago

py.test -svx

hughperkins commented 7 years ago

the -s prints stdout. It's horrible though. So, normally I make it runnable directly, with python, eg see https://github.com/hughperkins/tf-coriander/blob/master/tensorflow/stream_executor/cl/test/test_random.py#L64-L68

if __name__ == '__main__':
    if len(sys.argv) == 1:
        print('Please run using py.test')
    else:
        eval('%s((3, 4))' % sys.argv[1])
hughperkins commented 7 years ago

(in py.test, you can also use -k to choose one specific test. and you can run a specific test file by path, eg:

py.test -sv tensorflow/stream_executor/cl/test/test_random.py -k truncated
ghost commented 7 years ago

Keras uses a certain amount of custom test magic that I haven't dug into yet, so the simplest thing is to continue using py.test?

Thanks for the pointers; I've collected the test output for keras-1.1.1/tests/keras/layers/test_recurrent.py, attached (~15Mb uncompressed.. O_o).

recurrent_test_output.txt.tar.gz

hughperkins commented 7 years ago

Ok. Seems these are all passing? I searched for error, case-insensitve, but found nothing?

ghost commented 7 years ago

Those are the tests that pass, but sometimes freeze. So I'm suspicious of memory leaks as a potential cause? But, why it would run OK some times and not others, is mysterious.

I'll open a separate issue in coriander for the segfaulting tests, shall I? I can provide spammy output from py.test and the failing kernel in both cases.

ghost commented 7 years ago

The freeze-midway thing also happened when I was doing an embedding + LSTM classification task; trained for plenty of batches/epochs, and then suddenly stopped and simply hung on a certain batch. I was patient but it never recovered after 10m+, so I killed it..

hughperkins commented 7 years ago

freezing sounds like a loop that never terminates. Probably needs a ton of debugging on my side.

hughperkins commented 7 years ago

(ie, inside a kernel, there's probably something that loops forever, for some reason...)

hughperkins commented 7 years ago

right, I'm going to turn off my email, and try to focus on the split star star issue, instead of procrastinating :-P

ghost commented 7 years ago

freezing sounds like a loop that never terminates. Probably needs a ton of debugging on my side.

:/ I'm sorry this isn't something I have the know-how to help with directly!

hughperkins commented 7 years ago

Thats fair.

hughperkins commented 7 years ago

So, good/bad news is that autoencoder.py hangs for me on Mac

(by the way, when I say 'hangs' I mean, my computer keeps running, but I have to do a full Mac restart, before I can run any tensorflow code at all on that gpu :-P , no way of ctrl-C'ing the process )

vktt commented 7 years ago

Hey guys,

What's the status on this? Also, why do you guys have two separate branches for v1.2.1:

v1.2.1-branch & v1.2.1-cl?

Is one of those branches in somewhat working condition?

hughperkins commented 7 years ago

v1.2.1-branch is the original from upstream. v1.2.1-cl is an attempt to start merging in the opencl stuff.

Neither is in somewhat working condition.

I'm working on other things at the moment. Open opportunity for someone to start looking into this.

QtRoS commented 7 years ago

What is the strategy of merging with upstream? Every time you must do a lot of work of manual steps? Do you have any ideas how to automate most of the routine work? I want to help, but in my opinion it can be very time consuming if you do it time to time, version to version, especially while tensorflow releases now appear much often.

hughperkins commented 7 years ago

well... ideally there should be very little to do, since almost all the original work is in coriander. And if you look at the commits, almost all the changes come down to:

However, there are a few exceptions to this:

QtRoS commented 7 years ago

Sorry, I am out of business because of old linux kernel without OpenCL support :(

hughperkins commented 7 years ago

Fair enough