keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.11k stars 19.36k forks source link

Memory leak during model.fit() #5935

Closed HareeshBahuleyan closed 7 years ago

HareeshBahuleyan commented 7 years ago

Hi,

I am trying to train a simple CNN in keras. During training via model.fit(), the system free memory keeps reducing and eventually it runs out of memory with a "Killed" error. When I train it one epoch at a time, I can clearly see the reduction in free memory after each epoch. Is this normal?

model = Sequential()
model.add(Conv2D(input_shape=(1,30,300), filters=10, kernel_size=(3, 300), padding='valid', 
                        data_format="channels_first", activation='relu'))
model.add(Reshape((10, 28)))
model.add(MaxPooling1D(pool_size=10))

model.add(Flatten())
model.add(BatchNormalization())
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Dense(10, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Dense(5, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid')) 

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X_train, Y_train, batch_size=32, epochs=1, validation_data=(X_test, Y_test), verbose=0)

However, when I set batch_size = 1, I see that it works fine with no memory leaks.

I am running keras version 2.02 with theano backend on CPU.

Thanks, Hareesh

Edit: I was not facing this issue with Keras version 1.2.

Kajiyu commented 7 years ago

+1

joelthchao commented 7 years ago

Hi @HareeshBahuleyan, I use your model as target to monitor memory usage. Script:

# ENV: Macbook Pro 2012, keras: '2.0.1', theano: '0.9.0.dev-a4126bcced010b4bf0022ebef3e3080878adc480'
import resource
class MemoryCallback(Callback):
    def on_epoch_end(self, epoch, log={}):
        print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
# ...
X_train = np.random.rand(20000, 1, 30, 300)
Y_train = np.random.randint(0, 2, size=20000)
X_test = np.random.rand(20000, 1, 30, 300)
Y_test = np.random.randint(0, 2, size=20000)

model.fit(X_train, Y_train, batch_size=32, epochs=10,
          validation_data=(X_test, Y_test), verbose=0, callbacks=[MemoryCallback()])

Result shows that no obvious memory leak found. Can you add the callback to monitor memory usage on your own data?

3015524352
3019411456
3023294464
3024400384
3024400384
...
HareeshBahuleyan commented 7 years ago

Hi @joelthchao I get the following output with batch_size=32

14199908
14199908
15540832
18307928
21075688

With batch_size=1,

14199908
14199908                                                                                                                                                                                        
14199908
14199908
14199908                                                                                                                                                                                      

Moreover, when I monitor memory using free -m on the command line, I see a clear decline in the free memory as the training progresses, for batch sizes larger than 1.

I am running it on a server with AMD Opteron(tm) Processor 4284.

A similar issue has been raised by #5924

joelthchao commented 7 years ago

@HareeshBahuleyan Your memory usage is too low which is quite weird.

  1. What's your environment: keras/theano/Python version? OS?
  2. What's your actual shape of X_train, Y_train, X_test, Y_test?
HareeshBahuleyan commented 7 years ago

@joelthchao Yes I too noticed that, inspite of having larger train and test set.

  1. Python 2.7.3 Keras 2.0.2 Theano 0.9.0
uname -a
Linux boxname 3.6.7-4.fc17.x86_64 #1 SMP Tue Nov 20 19:40:01 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/*-release
Fedora release 17 (Beefy Miracle)
...
  1. The shapes are:
    X_train: (79983, 1, 30, 300)
    Y_train: (79983,)
    X_test: (19996, 1, 30, 300)
    Y_test: (19996,)
joelthchao commented 7 years ago

@HareeshBahuleyan I switch to another environment which is almost same as yours. However, I cannot reproduce your result even use inputs with same shape.

ENV: 
python 2.7, theano 0.9.0, linux (ubuntu), keras 2.0.2

keras.json:
{
    "floatx": "float32",
    "epsilon": 1e-07,
    "backend": "theano",
    "image_data_format": "channels_first"
}

Result (Run on cpu):
7190724, 7190724, 7190724, 7190724, ...

Does your testing script involve any other operations which are not shown here?

HareeshBahuleyan commented 7 years ago

@joelthchao No I don't have any other operations other than this (just loading the data before this). I tried with your input shapes:

X_train = np.random.rand(20000, 1, 30, 300)
Y_train = np.random.randint(0, 2, size=20000)
X_test = np.random.rand(20000, 1, 30, 300)
Y_test = np.random.randint(0, 2, size=20000)

I get the callback out put as 14201068, 14201068, 14201068, 14201068, 14201068

However, I also monitor the memory usage with the command free -m from another screen as the model.fit() progresses. The output is:

capture

As you see, the free memory keeps decreasing and the process gets killed eventually (if it is run for too many epochs). The final free -m is after the script is completed and the program is exited. Note that I am not running any other processes.

Also, like I mentioned this free memory remains constant with batch_size=1.

joelthchao commented 7 years ago

@HareeshBahuleyan

  1. free monitors whole system's free memory, which is a little bit dangerous and inaccurate, even you don't run any other processes. In my case, I use top -p PID as an alternative way to monitor memory usage.
  2. To help you, I need to reproduce your problem. What is (just loading the data before this)? pickle or numpy load? If possible, please provide runnable script which produces memory leak.
HareeshBahuleyan commented 7 years ago

@joelthchao I am loading hickle and pickle objects. The script can be found here.

Monitoring memory with top -p PID also shows the increase in %MEM as training progresses.

I don't think this is an issue with this specific network. I have tried other network architectures on the same machine and I still face the same issue.

Update: I ran the code on another Ubuntu machine and did not face any memory leak issues. This could mean that the issue is present only on certain CPUs as mentioned in #5924

Thanks for the help!

ayuyu18 commented 7 years ago

I got the same issue. Keras 2.0.2 on Windows with Theano backend. The memory consumption keeps increasing and finally the program crashed.

duykienvp commented 7 years ago

I got the same issue on Linux with Keras 2.0.2, but it works fine on macOS Sierra

joelthchao commented 7 years ago

@duykienvp Please help run the test script to verify memory leak problem

ebalp commented 7 years ago

Python 2.7.13 tensorflow (1.0.0) Keras (2.0.2)

MacOs El Capitan 10.11.6

No problem

3083554816 3085885440 3085885440 3085885440 3085885440 3085885440 3085885440 3085885440 3085885440 3085885440

duykienvp commented 7 years ago

---- System 1 (Google Cloud Engine): Ubuntu 16.04 LTS Linux 4.4.0-70-generic Python 2.7.13 :: Anaconda 4.3.1 (64-bit) Theano: 0.9.0 Keras: 2.0.2

3679756 4372232 5065440 5759912 6452212 7145628 ....

---- System 2 (Google Cloud Engine) (same machine with System 1): Ubuntu 16.04 LTS Linux 4.4.0-70-generic Python 2.7.13 :: Anaconda 4.3.1 (64-bit) TensorFlow: 1.0.1 Keras: 2.0.2

3015396 3017412 3018072 3019272 3019452 3019528 3019528 3019528 3019528 3019528

---- System 3 : macOS Sierra 10.12.4 Python 2.7.13 :: Continuum Analytics, Inc. Theano: 0.9.0 Keras: 2.0.2

3025858560 3025862656 3025862656 3025866752 3025891328 3025891328 3025891328 3025891328 3025891328 3025891328

joelthchao commented 7 years ago

@duykienvp Cool! It seems like the problem comes to theano 0.9.0, on linux with some CPUs.

hft7h11 commented 7 years ago

I am getting the same issue, but on Windows. Memory leaks

I am using this example script

https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py

I have reproduced the issue on Keras 2.0. and 2.0.3 (Python = 2.7.11, Theano = 0.9.0, numpy =1.12.1 ) on Windows 10

This issue was not happening before I upgraded from Keras 1.2 to 2.0, so I suspect it is with one of the dependent c using libraries which were also upgraded

fchollet commented 7 years ago

Are the Theano devs aware of this? If not, please open an issue there.

On 9 April 2017 at 21:09, hft7h11 notifications@github.com wrote:

I am getting the same issue. Memory leaks.

I am using this example script

https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py

I have reproduced the issue on Keras 2.0. and 2.0.2 (Python = 2.7.11, Theano = 0.9.0 )

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/5935#issuecomment-292845429, or mute the thread https://github.com/notifications/unsubscribe-auth/AArWb6qUagLlmo7y5j7rfbxJ2SsO-qRSks5ruatjgaJpZM4Ml--j .

hft7h11 commented 7 years ago

@fchollet

I have commented on the Theano bug ticket. In the mean time this is a bit of a blocker for Keras 2.0 theano use. Reverting to Theano 0.8.2 fixes the memory leak, however certain layers such MaxPooling2D seem to depend on Theano 9.0 as per https://github.com/fchollet/keras/issues/5785

MrtnStnwk commented 7 years ago

Same problem here; Centos6, Python 2.7.13, Keras 2.0.2, Theano 0.9.0. Running on CPU. Would appreciate a suggestions for a solution.

nouiz commented 7 years ago

There is a pr to fix this in Theano.

The problem only happen if Theano can't link directly to BLAS. One work around that should also speed up computation is to install a good library that Theo can reuse.

Le lun. 10 avr. 2017 10:05, hft7h11 notifications@github.com a écrit :

@fchollet https://github.com/fchollet

I have commented on the Theano bug ticket. In the mean time this is a bit of a blocker for Keras 2.0 theano use. Reverting to Theano 0.8.2 fixes the memory leak, however certain layers such MaxPooling2D seem to depend on Theano 9.0 as per #5785 https://github.com/fchollet/keras/issues/5785

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/5935#issuecomment-292959155, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-_hI8Ppg_Yd22kPx1-6F40NBHgarks5rujcPgaJpZM4Ml--j .

MrtnStnwk commented 7 years ago

I can confirm the fix mentioned by @nouiz . Properly linking to MKL solved the problem. http://deeplearning.net/software/theano/troubleshooting.html#test-blas

nouiz commented 7 years ago

The leak was masted in the master of Theano. I would recommand to link to a good blas. This will give you speed up at the same time.

Otherwise, update Theano to the dev version.

nouiz commented 7 years ago

I think this issue can be closed.

hft7h11 commented 7 years ago

@nouiz I think it is worth noting that bleeding edge theano is not working with Keras 2.03 for LSTM I just installed bleeding edge Theano and am getting the following error:

Traceback (most recent call last): File "C:\Anaconda2\workspace\machineLearning\textClassifier.py", line 101, in model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2)) File "C:\Anaconda2\lib\site-packages\keras\models.py", line 463, in add output_tensor = layer(self.outputs[0]) File "C:\Anaconda2\lib\site-packages\keras\layers\recurrent.py", line 257, in call return super(Recurrent, self).call(inputs, kwargs) File "C:\Anaconda2\lib\site-packages\keras\engine\topology.py", line 578, in call output = self.call(inputs, kwargs) File "C:\Anaconda2\lib\site-packages\keras\layers\recurrent.py", line 294, in call constants = self.get_constants(inputs, training=None) File "C:\Anaconda2\lib\site-packages\keras\layers\recurrent.py", line 1068, in getconstants training=training) for in range(4)] File "C:\Anaconda2\lib\site-packages\keras\backend\theano_backend.py", line 1361, in in_train_phase x = theano.ifelse.ifelse(training, x, alt) AttributeError: 'module' object has no attribute 'ifelse'

If I just install Theano using pip install Theano==0.9, the code won't break but I still have the memory issue.

itachi4869 commented 7 years ago

@hft7h11 I got the same error that 'module' object has no attribute 'ifelse'. Is there a good method to solve the problem?

andcut commented 7 years ago

Python 3.6.1 Keras 2.0.2 Tensorflow 1.0.1 Ubuntu 16.04

I load data using pickle and had a similar memory leak when using model.fit

MrtnStnwk commented 7 years ago

Link properly against MKL as described in my previous post and it will be solved.

Verstuurd vanaf mijn iPhone

Op 24 apr. 2017 om 19:59 heeft andcut notifications@github.com het volgende geschreven:

Python 3.6.1 Keras 2.0.2 Tensorflow 1.0.1 Ubuntu 16.04

I load data using pickle and had a similar memory leak when using model.fit

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

srph25 commented 7 years ago

@hft7h11 @itachi4869 In theano/backend/theano_backend.py, replace x = theano.ifelse.ifelse(training, x, alt) with x = ifelse.ifelse(training, x, alt)

mave5 commented 7 years ago

I encountered the same issue when upgraded from Keras 1.1.1 to Keras 1.2.x or Keras 2.x As such, I stick to Keras 1.1.1, and it works fine.

KaiHuangMO commented 7 years ago

to solve ifelse problem, just
import theano.ifelse from theano.ifelse import IfElse, ifelse at the begining in theano_backend.py

BTW, I use theano 0.9

gzapatas commented 6 years ago

Hello everybody,

I'm new in the forum and I also face the same memory leak proble in ubuntu.

Features: OS Ubuntu 16.04.2 64 bit Keras 2.0.6 Theano 0.9.0

Solve:

xiaoleihuang commented 6 years ago

I have the same problem Features: OS Ubuntu 16.04.2 64 bit Keras 2.0.6 Theano 0.9.0 Python 3

I tested the same code under python2, it does not have such issue, only for python 3.

@gzapatas His method can solve the leaking problem! Thanks...

Run sudo apt-get install libblas-dev python-dev Take a look at Theano official website, it requires "BLAS" installation: http://deeplearning.net/software/theano/install_ubuntu.html

If you run the program on GPU, other packages are also highly recommended:

  1. libgpuarray
  2. pycuda and skcuda
  3. etc.
RoozbehBandpey commented 6 years ago

I had the same problem, Solved by switching to tensorflow backend.

BIGBALLON commented 6 years ago

I had the same problem(tensorflow backend. python3.5, keras2.1.2)

maryam2013 commented 5 years ago

Hi there, I had the same problem(tensorflow backend. python3.5, keras '2.1.3', UBUNTU 17.10, GPU= Nvidia gtx1060, RAM= 16 GIG, they all installed on Hard ssd128 gig) it gave me the error: File "/home/mary/anaconda3/envs/virenv/lib/python3.5/site-packages/gensim/models/utils_any2vec.py", line 180, in _load_word2vec_format result.vectors = zeros((vocab_size, vector_size), dtype=datatype) MemoryError I try to load pre-trained word2vec model. How can i solve it ? vocab_size= 59655 , EMBEDDING_DIM=300

Tixierae commented 5 years ago

@joelthchao I'm having the same problem with tensorflow backend (like @BIGBALLON), on Ubuntu 16.04 with keras 2.2.0 and tensorflow-gpu 1.5.0. The RAM gets eat up after each epoch, so if the model is trained for too many epochs I'm out of RAM. I use a custom generator with use_multiprocessing=False and workers=1.

mittermario commented 5 years ago

similar problem on ubuntu 18.04 with tensorflow backend, cpu only - memory is successively eaten by the keras fit function with a simple 2 layer DNN

prakhar21 commented 5 years ago

Hello everybody,

I'm new in the forum and I also face the same memory leak proble in ubuntu.

Features: OS Ubuntu 16.04.2 64 bit Keras 2.0.6 Theano 0.9.0

Solve:

* I use the command line 'sudo apt-get install libblas-dev', it also install theano dependency of blas library and it didn't have any memory leak again.

This works.

nicolamarinello commented 5 years ago

@joelthchao I'm having the same problem with tensorflow backend (like @BIGBALLON), on Ubuntu 16.04 with keras 2.2.0 and tensorflow-gpu 1.5.0. The RAM gets eat up after each epoch, so if the model is trained for too many epochs I'm out of RAM. I use a custom generator with use_multiprocessing=False and workers=1.

I have exactly your problem with my custom generator. TensorFlow 1.12, Keras 2.2.4 & Ubuntu 18.04.

Did everyone solve just typing the following?

sudo apt-get install libblas-dev
maryam2013 commented 5 years ago

This message is not encrypted but sent from a verified user on the dmail blockchain https://dmail.io

Hi everyone, you can use googe collaboratory to solve the memory leak.

https://colab.research.google.com/notebooks/gpu.ipynb

good luck

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5& 01/21/19, 10:11:39 PM

On Mon, Jan 21, 2019 at 4:34 PM Nicola Marinello notifications@github.com wrote:

@joelthchao https://github.com/joelthchao I'm having the same problem with tensorflow backend (like @BIGBALLON https://github.com/BIGBALLON), on Ubuntu 16.04 with keras 2.2.0 and tensorflow-gpu 1.5.0. The RAM gets eat up after each epoch, so if the model is trained for too many epochs I'm out of RAM. I use a custom generator with use_multiprocessing=False and workers=1.

I have exactly your problem with my custom generator. TensorFlow 1.12, Keras 2.2.4 & Ubuntu 18.04.

Did everyone solve just typing the following?

sudo apt-get install libblas-dev

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/5935#issuecomment-456067055, or mute the thread https://github.com/notifications/unsubscribe-auth/AkQjk5n4GuUb84lq-5SVJqNFfAoBt5GIks5vFbrrgaJpZM4Ml--j .

hougrammer commented 4 years ago

Was this thread ever closed? I have the same issue. Simple feed forward network with no looping. Ubuntu 18.04, Keras 2.2, Tensorflow 1.13. I've tried most of the solutions on this thread (except going to colab instead of jupyter). Nothing seems to fix the memory leak.

maryam2013 commented 4 years ago

Hi David, why do not you try this command:

'sudo apt-get install libblas-dev', it also install theano dependency of blas library and it didn't have any memory leak again.

and I also recomend to use Goggle colab which is free and is so similar to Jupiter.

On Mon, Sep 23, 2019 at 8:01 AM David Hou notifications@github.com wrote:

Was this thread ever closed? I have the same issue. Simple feed forward network with no looping. Ubuntu 18.04, Keras 2.2, Tensorflow 1.13. I've tried most of the solutions on this thread (except going to colab instead of jupyter). Nothing seems to fix the memory leak.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/5935?email_source=notifications&email_token=AJCCHEYTVQXKEZJN5YEXLKTQLBBCPA5CNFSM4DEX56R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7JYPEI#issuecomment-533956497, or mute the thread https://github.com/notifications/unsubscribe-auth/AJCCHE7YWYDICDNXAQGJU4DQLBBCPANCNFSM4DEX56RQ .

hougrammer commented 4 years ago

Hi David, why do not you try this command: 'sudo apt-get install libblas-dev', it also install theano dependency of blas library and it didn't have any memory leak again. and I also recomend to use Goggle colab which is free and is so similar to Jupiter.

Thanks Mary. Unfortunately I'm using Tensorflow and not Theano. Also libblas-dev was already installed anyways. Do you have any other suggestions?

maryam2013 commented 4 years ago

Dear David, I am used to applying google collaboratory to solve the problem. check it out and do not hesitate to ask me your question. https://colab.research.google.com/drive/1pGErg2HaWfFVa3jCCFSDRGNjlkDaqzE4#updateTitle=true&folderId=1xFFASkjeHGhgH2EWfUBHY2QzErXu6Dpn

Best Maryam

On Thu, Sep 26, 2019 at 6:24 AM David Hou notifications@github.com wrote:

Hi David, why do not you try this command: 'sudo apt-get install libblas-dev', it also install theano dependency of blas library and it didn't have any memory leak again. and I also recomend to use Goggle colab which is free and is so similar to Jupiter. Thanks Mary. Unfortunately I'm using Tensorflow and not Theano. Also libblas-dev was already installed anyways. Do you have any other suggestions?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/5935?email_source=notifications&email_token=AJCCHE3VX2XZIDFZK4WIDYLQLQP65A5CNFSM4DEX56R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7UCW2A#issuecomment-535309160, or mute the thread https://github.com/notifications/unsubscribe-auth/AJCCHE4RITYYRI2PRNMXF33QLQP65ANCNFSM4DEX56RQ .

hougrammer commented 4 years ago

I finally found the issue for me. Tensorflow 1.14 has this memory leak but 1.13 does not.

fPkX6F1nGTX commented 4 years ago

For Ubuntu and TF 2.0 using Keras backend, I was able to work around (but not solve) this problem. I recommend re-opening the issue, my "solution" involves potentially double the amount of computation time: https://stackoverflow.com/a/61252435/12763497

raqueldias commented 3 years ago

I found the same problem with TF2.2 and tf.keras model.fit()

Justus-M commented 3 years ago

I am also having this issue with TF2.2 and model.fit in 5 fold cross validation. I explicitly delete all the objects, clear session, and call the garbage collector at the end of the fit function for every iteration but still get a memory buildup so I am limited despite having 4 P100s being used in parallel and 120GB of RAM

yymarcin commented 3 years ago

I'm also having this issue with TF2.3.1. Using tf.compat.v1.disable_v2_behavior() fixed it.

fPkX6F1nGTX commented 3 years ago

@justinmulli @yymarcin I suggest doing clear_session before defining any model. Also, do gc.collect() before del model (it works for me but I do not know why it would not work as well for after deleting it) after you are done using it.