RNN fails on 1.0.8 but runs fine on 1.0.7

fmfn commented 8 years ago

I get the following when trying to train a model (on a CPU) after upgrading to 1.0.8. Interestingly it works if I downgrade to 1.0.7. Perhaps even more surprising is that it works (with 1.0.8) on a ubuntu-GPU setup.

Traceback (most recent call last):
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/theano/compile/function_module.py", line 859, in __call__
    outputs = self.fn()
MemoryError: alloc failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./pipeline/train.py", line 162, in <module>
    validation_data=[Xva, Yva],
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/models.py", line 620, in fit
    sample_weight=sample_weight)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/engine/training.py", line 1104, in fit
    callback_metrics=callback_metrics)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/engine/training.py", line 822, in _fit_loop
    outs = f(ins_batch)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/backend/theano_backend.py", line 672, in __call__
    return self.function(*inputs)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/theano/compile/function_module.py", line 871, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/theano/gof/link.py", line 314, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/theano/compile/function_module.py", line 859, in __call__
    outputs = self.fn()
MemoryError: alloc failed
Apply node that caused the error: AllocEmpty{dtype='float32'}(TensorConstant{11}, Elemwise{Composite{Switch(EQ(i0, i1), ((i2 * i0) // (i3 * i0)), i0)}}.0, TensorConstant{25})
Toposort index: 201
Inputs types: [TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar)]
Inputs shapes: [(), (), ()]
Inputs strides: [(), (), ()]
Inputs values: [array(11), array(-1334), array(25)]
Outputs clients: [[IncSubtensor{InplaceSet;:int64:}(AllocEmpty{dtype='float32'}.0, Rebroadcast{0}.0, Constant{1})]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
  File "./pipeline/train.py", line 90, in model_loader
    model, encoder = _get_model()
  File "./pipeline/train.py", line 65, in _get_model
    name='decoder_rnn_0')
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/models.py", line 308, in add
    output_tensor = layer(self.outputs[0])
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/engine/topology.py", line 515, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/engine/topology.py", line 573, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/engine/topology.py", line 150, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/layers/recurrent.py", line 213, in call
    input_length=input_shape[1])
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/backend/theano_backend.py", line 842, in rnn
    go_backwards=go_backwards)

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

The model is:

encoder = Sequential(name="encoder")
encoder.add(
    Masking(
        input_shape=(config.Model.maxlen, config.Model.max_features),
        mask_value=0,
    )
)
encoder.add(
         LSTM(output_dim=config.Model.lstm_size,
         return_sequences=False,
         go_backwards=False,
         name='encode_rnn_0')
)

model = Sequential(name='char-auto-encoder')
model.add(encoder)

# Context
model.add(
    RepeatVector(n=config.Model.maxlen,
                 name='context_vector_repeat')
)

model.add(
        LSTM(output_dim=config.Model.lstm_size,
             return_sequences=True,
             go_backwards=False,
             name='decoder_rnn_0')
)

model.add(
    TimeDistributed(
        Dense(
            output_dim=config.Model.max_features,
            activation='softmax',
            name='distribution_over_tokens'
        ),
    )
)

tanbur commented 8 years ago

I'm getting similar errors, running a smaller, simpler model using version 1.0.7 on a CPU (Linux). LSTMs with fewer than 15 output nodes seem to train fine. 16+ nodes gives an alloc error similar to yours, but running theano.config.mode='NanGuardMode' says that some Infs are popping up.

EDIT: Keras 1.0.7

fmfn commented 8 years ago

I noticed something similar. Training the same model as above with 1.0.8 on a GPU failed with maxlen = 140, however it worked with maxlen = 120

fchollet commented 8 years ago

Do you have a reproducible code snippet? I've haven't noticed anything weird with large RNNs on CPU.

On 31 August 2016 at 07:46, Richard Tanburn notifications@github.com wrote:

I'm getting similar errors, running a smaller, simpler model using version 1.0.8 on a CPU (Linux). LSTMs with fewer than 15 output nodes seem to train fine. 16+ nodes gives an alloc error similar to yours, but running theano.config.mode='NanGuardMode' says that some Infs are popping up.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/3637#issuecomment-243787744, or mute the thread https://github.com/notifications/unsubscribe-auth/AArWbzWj8W88PV9KIlDhwhtESJdWWcfnks5qlZPEgaJpZM4JxSZe .

fmfn commented 8 years ago

The following runs on 1.0.7 and fails (with the above traceback) on 1.0.8

import numpy as np

from keras.layers import LSTM, Dense, RepeatVector
from keras.layers import TimeDistributed, Masking#, Bidirectional
from keras.models import Sequential
from keras.optimizers import Nadam
from keras.objectives import categorical_crossentropy
from keras import backend as K

def sparse_char_softmax(y_true, y_pred):
    steps_loss = [
        categorical_crossentropy(y_true[:, i, :], y_pred[:, i, :])
        for i in range(10)
    ]
    return K.sum(steps_loss) / \
        (10 + 256)

def model_loader(maxlen, max_features, lstm_size):
    encoder = Sequential(name="encoder")
    encoder.add(
        Masking(
            input_shape=(maxlen, max_features),
            mask_value=0,
        )
    )
    encoder.add(
             LSTM(
                 output_dim=lstm_size,
                 return_sequences=False,
                 go_backwards=False,
                 name='encoder_rnn_0'
             )
    )

    model = Sequential(name='char-auto-encoder')
    model.add(encoder)

    # Context
    model.add(
        RepeatVector(n=maxlen,
                     name='context_vector_repeat')
    )

    model.add(
            LSTM(
                output_dim=lstm_size,
                return_sequences=True,
                go_backwards=False,
                name='decoder_rnn_0'
            )
    )

    model.add(
        TimeDistributed(
            Dense(
                output_dim=max_features,
                activation='softmax',
                name='distribution_over_tokens'
            ),
        )
    )

    return model, encoder

if __name__ == "__main__":

    X = np.zeros((256, 10, 80), dtype=bool)
    for row in X:
        for col in row:
            col[np.random.randint(0, 80)] += 1

    model, encoder = model_loader(10, 80, 25)
    model.summary()
    encoder.compile('sgd', 'mse')
    model.compile(
        loss=sparse_char_softmax,
        optimizer=Nadam(lr=0.001, clipnorm=2.0),
    )

    h = model.fit(
        X, X,
        nb_epoch=1,
        verbose=1,
        batch_size=256,
        validation_data=[X, X],
    )

fchollet commented 8 years ago

Your loss function should not work (K.sum must be called on a tensor, not a list). If I replace it with MSE your script runs fine with both Theano and TF.

fmfn commented 8 years ago

Thanks! I figured it had to be the loss function, despite it working ok in prior releases.

fchollet commented 8 years ago

Use sum instead (i.e. Python sum operator).

fmfn commented 8 years ago

Below is what I get when compiling with mse loss. Btw, the snippet above works (with sparse_char_softmax loss and keras 1.0.8) in a ubuntu 14.04, CUDA8rc, python 2.7, gtx 1080 setup.

Epoch 1/1
Traceback (most recent call last):
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/theano/compile/function_module.py", line 859, in __call__
    outputs = self.fn()
MemoryError: alloc failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "keras_bug.py", line 87, in <module>
    validation_data=[X, X],
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/models.py", line 620, in fit
    sample_weight=sample_weight)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/engine/training.py", line 1104, in fit
    callback_metrics=callback_metrics)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/engine/training.py", line 822, in _fit_loop
    outs = f(ins_batch)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/backend/theano_backend.py", line 672, in __call__
    return self.function(*inputs)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/theano/compile/function_module.py", line 871, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/theano/gof/link.py", line 314, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/theano/compile/function_module.py", line 859, in __call__
    outputs = self.fn()
MemoryError: alloc failed
Apply node that caused the error: AllocEmpty{dtype='float32'}(TensorConstant{11}, Elemwise{Composite{Switch(EQ(i0, i1), ((i2 * i0) // (i3 * i0)), i0)}}.0, TensorConstant{25})
Toposort index: 190
Inputs types: [TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar)]
Inputs shapes: [(), (), ()]
Inputs strides: [(), (), ()]
Inputs values: [array(11), array(-10667), array(25)]
Outputs clients: [[IncSubtensor{InplaceSet;:int64:}(AllocEmpty{dtype='float32'}.0, Rebroadcast{0}.0, Constant{1})]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
  File "keras_bug.py", line 74, in <module>
    model, encoder = model_loader(10, 80, 25)
  File "keras_bug.py", line 50, in model_loader
    name='decoder_rnn_0'
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/models.py", line 308, in add
    output_tensor = layer(self.outputs[0])
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/engine/topology.py", line 515, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/engine/topology.py", line 573, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/engine/topology.py", line 150, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/layers/recurrent.py", line 213, in call
    input_length=input_shape[1])
  File "/Users/<me>/venvs3/general/lib/python3.5/site-packages/keras/backend/theano_backend.py", line 842, in rnn
    go_backwards=go_backwards)

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

fchollet commented 8 years ago

Did you try updating Keras to the master version. In theory Theano RNNs should be strictly identical between 1.0.7 and master.

fmfn commented 8 years ago

Updated with the master version and it worked. Thanks!

keras-team / keras

RNN fails on 1.0.8 but runs fine on 1.0.7 #3637