keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.97k stars 19.46k forks source link

Efficient way for one hot representation in Keras #1002

Closed seven7e closed 7 years ago

seven7e commented 8 years ago

I am a new user of RNN and Keras for language mode. I found Keras accepts 3D tensor as input of RNN, which means word sequences have to be encoded into sequences of word vectors. The simplest is one hot encoding, but that's a heavy waste of memory because most elements in the 3D tensor is zero.

I only find a Embedding layer which accepts index represented word sequence (no need for one hot encoding and thus memory efficient), but such layer generates a DENSE word vector and then feed this vector to the recurrent layer, which forces me to use dense representation instead of one hot encoding.

Is there any efficient way for one hot encoding? Or did I missed something?

Besides, I got "g++ not detected" error while data set goes large, but the same code works for small data set. I asked a question on SO http://stackoverflow.com/questions/33671453/g-not-detected-while-data-set-goes-larger-is-there-any-limit-to-matrix-size I thought larger data set might be supported if there was a memory-saving way for one hot representation.

rex-yue-wu commented 8 years ago

I encountered the same issue as nanoix9. Although it is still possible for me to train a model using techniques like memory mapping, I prefer to load my entire dataset in memory. Is it possible to use Theano's sparse matrix representation?

EderSantana commented 8 years ago

store your data to hdf5 and try this class https://github.com/fchollet/keras/blob/master/keras/utils/io_utils.py#L7-L52

RaffEdwardBAH commented 8 years ago

@EderSantana Will using that class (could some documentation be added?) use sparse matrices on the GPU? Some of the problems I'm trying I have more than enough RAM in my desktop, the GPU is the limitation.

placebokkk commented 8 years ago

@nanoix9 That's true. You need to write you own class(RecurrentEmbedding) if you want to use index as the input of RecurrentLayer directly.

RaffEdwardBAH commented 8 years ago

I was able to reduce the memory usage significantly by using an Embedding layer at the first step, which allows for a more efficient input form - though isn't quite identical to the model I was hoping to replicate.

To further this question a bit, the categorical_crossentropy loss implementation in Keras makes use of T.nnet.categorical_crossentropy. The documentation for this method indicates that the target may be either the same dimension, or a 1 dimensional list of integers which will be treated as a one-hot encoding of the target vector. However, if I give Keras a 1-hot target as the output, I will get an error like Input dimension mis-match. (input[0].shape[1] = 1, input[1].shape[1] = 256) (in this case I had a 256 dimension softmax as the final layer/target). Is there a way to make this input for work in Keras? The memory use difference is huge for even a modest number of target values.

Extending this further, is there a way to define a layer for mismatched output sizes and Y vectors? That way we could implement more efficent representations for k-hot vectors or othe specialized types. This is similar to the question in issue #1043 , but on the opposite end of the network. Its a bit trickier, since I want the output to be say, a 256 dimension softmax, but I want to represent the target in a more space efficient format.

tttwwy commented 8 years ago

Is there any way to fix it?

RaffEdwardBAH commented 8 years ago

I actually just got this partially working, when Keras munges the input it converts everything to floatX, so if I can define my own loss that converts the list back to ints

def my_one_hot_categorical_crossentropy(y_true, y_pred):
    '''Expects a binary class matrix instead of a vector of scalar classes
    '''
    epsilon = 1.0e-7

    y_pred = T.clip(y_pred, epsilon, 1.0 - epsilon)
    # scale preds so that the class probas of each sample sum to 1
    y_pred /= y_pred.sum(axis=-1, keepdims=True)
    #orig
    #cce = T.nnet.categorical_crossentropy(y_pred, y_true)
    cce = T.nnet.categorical_crossentropy(y_pred, T.cast(y_true.flatten(), 'int32'))
    return cce

Which will now accept (only) a one-hot target vector. I'm new to writing code with Theano, if there is a way to check for a dimension mismatch, and the dimension mismatch is because the target is has a size of 1 along the target dimension, it should call that instead. It would also be good to know how we can insert Theano print statements given the whole Keras pipeline.

rex-yue-wu commented 8 years ago

Theano categorical_crossentropy accepts one-hot vectors, and it expects a one-hot vector to be of type int instead float. However, keras converts everything to float, and this the reason why we have to convert it back to int as said by RaffEdwardBAH. I think we should add this cost function to keras as a standard one.

RaffEdwardBAH commented 8 years ago

After updating to the new head version of Keras, my custom function no longer seems to work. When i compile I get an error on the cce = T.nnet.categorical_crossentropy(y_pred, T.cast(y_true.flatten(), 'int32')) line that reads

TypeError                                 Traceback (most recent call last)
<ipython-input-9-9af9fd60d08b> in <module>()
     11 model.add(Activation('softmax'))
     12 
---> 13 model.compile(loss=one_hot_categorical_crossentropy, optimizer=Adam(clipnorm=20))

/usr/local/lib/python2.7/dist-packages/Keras-0.3.0-py2.7.egg/keras/models.pyc in compile(self, optimizer, loss, class_mode, theano_mode)
    382         else:
    383             mask = None
--> 384         train_loss = weighted_loss(self.y, self.y_train, self.weights, mask)
    385         test_loss = weighted_loss(self.y, self.y_test, self.weights, mask)
    386 

/usr/local/lib/python2.7/dist-packages/Keras-0.3.0-py2.7.egg/keras/models.pyc in weighted(y_true, y_pred, weights, mask)
     76         mask: binary
     77         '''
---> 78         score_array = fn(y_true, y_pred)
     79         if mask is not None:
     80             score_array *= mask

<ipython-input-8-08ee58c4e220> in one_hot_categorical_crossentropy(y_true, y_pred)
     12     #orig
     13     #cce = T.nnet.categorical_crossentropy(y_pred, y_true)
---> 14     cce = T.nnet.categorical_crossentropy(y_pred, T.cast(y_true.flatten(), 'int32'))
     15     return cce

/usr/local/lib/python2.7/dist-packages/theano/tensor/nnet/nnet.pyc in categorical_crossentropy(coding_dist, true_dist)
   1875         return crossentropy_categorical_1hot(coding_dist, true_dist)
   1876     else:
-> 1877         raise TypeError('rank mismatch between coding and true distributions')
   1878 
   1879 

TypeError: rank mismatch between coding and true distributions

Anyone have an idea of what needs to change to fix this, or is this a regression somewhere else?

jfsantos commented 8 years ago

Just wanted to add I'm having TypeError issues as well when using categorical_crossentropy. Mine are a bit weirder:

TypeError: ('An update must have the same type as the original shared variable (shared_var=
<TensorType(float32, matrix)>, shared_var.type=TensorType(float32, matrix), 
update_val=Elemwise{add,no_inplace}.0, update_val.type=TensorType(float64, matrix)).', 'If the 
difference is related to the broadcast pattern, you can call the tensor.unbroadcast(var, 
axis_to_unbroadcast[, ...]) function to remove broadcastable dimensions.')

I don't know why some things were converted to float32 and others are float64. The inputs in my case are sequences of integers passing through an Embedding layer and the outputs are sequences of one-hot vectors (int32).

EderSantana commented 8 years ago

@jfsantos do you have more details? I only get that if I try to set something by hand, with set_value. Since numpy is float64.

jfsantos commented 8 years ago

I solved that error by setting floatX to float64 on $HOME/keras/keras.json (I'm testing on a CPU, so it's not a big deal). My model is a stack of LSTMs (with return_sequences=True), and a TimeDistributedDense layer with softplus activation on top.

The only "fancy" thing I'm doing is that my labels are sequences too, but I did that before the move to multiple backends with MSE as criterion and it worked well (I mean, the results were horrible, but the code worked).

EderSantana commented 8 years ago

Interesting, was your theano floatX=float64 already?

Btw, Keras does the following to calculate cost functions:(samples, time, dim) -> (samples*time, dim). If you have more dimensions than that, it will (samples, time, row, col) -> (samples*time*row, col) which messes cost average up. I had sequence to sequence learning with input and output videos and my results were horrible too xD. But I don't think that is your problem right?

jfsantos commented 8 years ago

Yes, I did not have a config for floatX so it defaults to float64 on the CPU.

And nope: my system has a softmax output which is a one-hot encoding of a single label per timestep, so I don't have that kind of issue and evaluating the error over all timesteps at once should not be a big deal.

tttwwy commented 8 years ago

Will set floatX=float32 okay?

hamidpalangi commented 8 years ago

@RaffEdwardBAH : Did you find a work around for your "TypeError: rank mismatch between coding and true distributions" ? I am getting the same error message.

My class labels size is about 60000 (=vocabulary size) which is typical when using one-hot vectors in language model training. It is not feasible for me to use "np_utils.to_categorical" ...

funkindy commented 7 years ago

Hello! I am trying to train RNN LSTM with embedding layer on top. My problem is the similar with @Palang2014. In the word level language model i have about 65k class labels so i cant one-hot encode them because of memory usage issue. Can anyone give a tip on how to overcome this?

Thanks.

Edit: seems like sparse_categorical_crossentropy helped

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.