tensorflow / models

Models and examples built with TensorFlow
Other
77.04k stars 45.77k forks source link

Running out of memory because of large input data (individual file) size, stuck at epoch 1/X #10534

Closed Alimarashli closed 2 years ago

Alimarashli commented 2 years ago

Hi,

I am training a Conv1D autoencoder but when I try to apply model.fit() it gets stuck at epoch1/2 regardless of how small the batch size is. When running on Colab, with random data of the same size, it runs out of memory and disconnects (not sure if GPU RAM or normal RAM) The code runs for smaller data and I also tried it on my personal workstation with RTX 3090 and 128GB of memory. I am not sure what I can do to fix the issue, the data size is only 4MB, while GPU memory is 24GB and PC memory is 128GB but even with a dataset of 2 and batch size of 2 it still gets stuck. code and colab link: https://colab.research.google.com/drive/1KL3tYnJc8rNn-5eqIPtdQrheogfwic0h?usp=sharing `

import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Sequential,Model
from tensorflow.keras.layers import Input,Dense,Conv1D,MaxPooling1D,UpSampling1D,Flatten,add,Cropping1D
from tensorflow.keras.layers import Conv2D,MaxPooling2D,UpSampling2D,Cropping2D,Reshape

def cnn1D(loss='mse',optimizer='adam',activation0='relu',activation='linear',x_shape=(531441,1),pooling1=3,pooling2=3,filter1=64,filter2=64,kernel=3):

    in_dim = x_shape
    input_img = Input(shape=in_dim) #input layer
    #conv1
    x1 = Conv1D(filter1, kernel, activation=activation0, padding='same')(input_img) # 100 100 64 
    x2 = MaxPooling1D(pooling1, padding='same')(x1) # 50 50 64
    #conv2
    x2 = Conv1D(filter2, kernel, activation=activation0, padding='same')(x2) # 50 50 128
    x3 = MaxPooling1D(pooling2, padding='same')(x2) # 25 25 128

    #de-conv2
    encoded = Conv1D(filter2, kernel, activation=activation0, padding='same')(x3) # 25 25 128
    y=UpSampling1D(pooling2)(encoded) # 50 50 128

    #de-conv1
    y=Conv1D(filter1, kernel, activation=activation0, padding='same')(y) # 50 50 128
    y=UpSampling1D(pooling1)(y) # 100 100 128

    decoded = Conv1D(x_shape[-1], 11, activation=activation, padding='same')(y) # 100 100 4

    cnn = Model(input_img, decoded)

    cnn.compile(loss=loss,optimizer=optimizer)#,metrics=['accuracy']) #adadelta
    cnn.summary()
    return cnn
nn2 = cnn1D()

import numpy as np

training_dataset = np.random.normal(size=(10,531441, 1))

nn2.fit(training_dataset, training_dataset, epochs=2, batch_size=2,validation_data=(training_dataset,training_dataset))

` Note: I come from physics background so sorry if the answer is obvious and I did something wrong

tensorflowbutler commented 2 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Have I written custom code OS Platform and Distribution TensorFlow installed from TensorFlow version Bazel version CUDA/cuDNN version GPU model and memory Exact command to reproduce

saberkun commented 2 years ago

Hi, this is a keras question. Could you ask this in the keras repo? https://github.com/keras-team/keras

jvishnuvardhan commented 2 years ago

@Alimarashli I was able to run your code without any issues with TF2.8.2. Here is the gist for our reference. Please try with recent TF versions and let us know if this was resolved for you.

If you want to use more data, then try to follow Better performance with the tf.data API. Thanks!

Please close the issue if this was resolved for you.

google-ml-butler[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] commented 2 years ago

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 2 years ago

Are you satisfied with the resolution of your issue? Yes No