tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.77k forks source link

SyntaxNet fails with CUDA out of memory #173

Closed orionr closed 8 years ago

orionr commented 8 years ago

SyntaxNet

I'm running on Ubuntu 16.04 with TensorFlow and models both built from git master branchs. Most of the models are working for me, but SyntaxNet fails with a CUDA out of memory error even though the card has 8GB total and nothing else is using those resources. Note that I'm on CUDA 8.0 RC here, but I doubt it makes a difference.

Output is as follows

~/git/models/syntaxnet$ echo 'Bob brought the pizza to Alice.' | syntaxnet/demo.sh
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
...
I syntaxnet/term_frequency_map.cc:101] Loaded 49 terms from syntaxnet/models/parsey_mcparseface/tag-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from syntaxnet/models/parsey_mcparseface/word-map.
I tensorflow/core/common_runtime/gpu/gpu_device.cc:783] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:01:00.0)
INFO:tensorflow:Building training network with parameters: feature_sizes: [12 20 20] domain_sizes: [   49    51 64038]
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 6.80G (7304685312 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 6.12G (6574216704 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 5.51G (5916794880 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 4.96G (5325115392 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 4.46G (4792603648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 4.02G (4313342976 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 3.62G (3882008576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 3.25G (3493807616 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
...
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.digit input.hyphen; input.prefix(length="2") input(1).prefix(length="2") input(2).prefix(length="2") input(3).prefix(length="2") input(-1).prefix(length="2") input(-2).prefix(length="2") input(-3).prefix(length="2") input(-4).prefix(length="2"); input.prefix(length="3") input(1).prefix(length="3") input(2).prefix(length="3") input(3).prefix(length="3") input(-1).prefix(length="3") input(-2).prefix(length="3") input(-3).prefix(length="3") input(-4).prefix(length="3"); input.suffix(length="2") input(1).suffix(length="2") input(2).suffix(length="2") input(3).suffix(length="2") input(-1).suffix(length="2") input(-2).suffix(length="2") input(-3).suffix(length="2") input(-4).suffix(length="2"); input.suffix(length="3") input(1).suffix(length="3") input(2).suffix(length="3") input(3).suffix(length="3") input(-1).suffix(length="3") input(-2).suffix(length="3") input(-3).suffix(length="3") input(-4).suffix(length="3"); input.token.word input(1).token.word input(2).token.word input(3).token.word input(-1).token.word input(-2).token.word input(-3).token.word input(-4).token.word 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: other;prefix2;prefix3;suffix2;suffix3;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 8;16;16;16;16;64
I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from syntaxnet/models/parsey_mcparseface/word-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 64036 terms from syntaxnet/models/parsey_mcparseface/word-map.
INFO:tensorflow:Total processed documents: 0
INFO:tensorflow:Total processed documents: 0
INFO:tensorflow:Read 0 documents

It also seems weird that SyntaxNet requires the tensorflow submodule, since I've actually checked out all of that (including dependencies) and built it in a different location. Would be nice if that wasn't needed, but not a big deal.

Any thoughts out there? Much appreciated.

s0okiym commented 8 years ago

I met the same error. cuda_driver.cc:965 CUDA_ERROR_OUT_OF_MEMORY when running the distributed mnist code.

orionr commented 8 years ago

Removed the GTX 1080 in title, since this might be experienced with other cards.

calberti commented 8 years ago

@orionr were you able to make any progress on this? I don't have much experience running SyntaxNet on different GPUs, but if you figured out a solution that might be useful to others.

borisstock commented 8 years ago

This issue can be fixed, by configuring thetf.Session with the following:

config.gpu_options.allow_growth = True

This seems to fix the problem for me!

zheng-xq commented 8 years ago

Does the program continue in spite of the errors? I think the errors shown here are harmless.

TensorFlow has its own BFC allocator. It asks a large chunk of memory from the Cuda driver, and does suballocate. If it runs out of memory, it double the sizes each time it asks from Cuda. When it runs out, it starts a final backpedal and starts to asks smaller amount of memory, and eventually settles on the largest memory that it can successfully gets.

This would be fatal if it fails to allocate a memory that is bigger than what has been asked. Normally it program would terminate itself at that point.

If you are really running out of memory, you can try to reduce the batch_size. Note that many of those models are developed with GPU with 12GB of memory. If it runs out of memory for GPU with less memory, reducing batch size could be a way to go.

borisstock commented 8 years ago

In my case the program did not continue. It crashed when it tried to allocate more than the 12 GB of my Titan X. I think somewhere there is an error, that it thinks that it did run out of memory and it tries and tries to allocate more and more. And somehow the "allow_groth" option fixed it for me (Cuda 7.5, CuDNN 5 on OS X). And I'm pretty sure 12 GB are more than enough for simply running the "demo.sh" script of PMCPF.

orionr commented 8 years ago

Thanks Boris. I don't have access to the machine until next week but I'll try it then.

On Jun 28, 2016, at 2:35 PM, Boris Stock notifications@github.com wrote:

In my case the program did not continue. It crashed when it tried to allocate more than the 12 GB of my Titan X. I think somewhere there is an error, that is thinks that it did run of memory and it tries and tries to allocate more and more. And somehow the "allow_groth" option fixed it for me (Cuda 7.5, CuDNN 5 on OS X).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

todtom commented 8 years ago

@orionr Hi, I have built syntaxnet succesefully, but it seems to work on cpus, rather than gpu. Could you tell me how to make it work on the gpus?

orionr commented 8 years ago

As a note, after updating tensorflow and models git repos and downgrading bazel to 0.2.2b everything works perfectly!

~/git/models/syntaxnet$ echo 'Bob brought the pizza to Alice.' | syntaxnet/demo.sh
Input: Bob brought the pizza to Alice .
Parse:
brought VBD ROOT
 +-- Bob NNP nsubj
 +-- pizza NN dobj
 |   +-- the DT det
 +-- to IN prep
 |   +-- Alice NNP pobj
 +-- . . punct

@todtom - You'll want to run ./configure inside the models/syntaxnet/tensorflow/ directory. Also make sure you have an NVIDIA card with modern CUDA capabilities. Good luck.

Shnurre commented 8 years ago

I am having the same error as is described by @orionr in the thread post.

I have Ubuntu 15.10, CUDA 7.5, cuDNN 4.0.7 and I was trying to build syntaxnet from up-to-date models git repos with bazel 0.2.2b as is described here #248 by @David-Ba . I also tried various other versions of bazel and cuDNN 5, but got the same error. I should also be noted that Syntaxnet without GPU support builds on my machine correctly and works as it is supposed to.

It appears that I did not manage to implement successfully the solution proposed here by @borisstock . I added config.gpu_options.allow_growth = True to all the files containing other modifications of config.gpu_options - files tensorflow/tensorflow/python/framework/test_util.py, tensorflow/tensorflow/python/kernel_tests/sparse_xent_op_test.py and tensorflow/tensorflow/python/kernel_tests/sparse_tensor_dense_matmul_op_test.py. I seems though that I missed something essential.

Could please @orionr , @borisstock or anyone else who managed to solve this problem specify where exactly should config.gpu_options.allow_growth = True be added?

orionr commented 8 years ago

I actually didn't need to use allow_growth = True after updating all of the git repos and downgrading bazel. @Shnurre - what GPU are you using? Also make sure you do a bazel clean before the rebuild. I even removed my _python_build directory inside tensorflow and recreated it each time just to be safe.

Shnurre commented 8 years ago

@orionr , thank you for your quick response. I have GTX 970 though I don't think this error is card-specific.

Yes, I always perform bazel clean before rebuilding. I also tried removing and downloding fresh models repo, manually removing .cache/bazel and completely reinstalling several versions of bazel, but nothing worked for me so far

Shnurre commented 8 years ago

@borisstock , @calberti , @orionr I am not sure if you are the right people to ask( if you are not, I am sorry for disturbing you), but should I reopen this issue or maybe open a new one? I am having exactly the same problems as described here by @orionr but changing bazel version and updating the repos didn't help me. I am still hoping that @borisstock or anyone else who successfully managed to implement his solution would be able to clarify the solution.

utkrist commented 7 years ago

In models/syntaxnet/syntaxnet/parser_eval.py, I made this change and it worked

gpu_opt = tf.GPUOptions(allow_growth=True)
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_opt)) as sess:
    Eval(sess)
irfan-zoefit commented 7 years ago

I'm having the same issue and don't know where to put the value.

config.gpu_options.allow_growth=true Would you specify the file.

zerodarkzone commented 6 years ago

Hi, I keep getting the CUDA_OUT_OF_MEMORY error. I already tried the fix proposed here but it doesn't work. I compiled it with bazel 0.5.4