soumith / cudnn.torch

Torch-7 FFI bindings for NVIDIA CuDNN
BSD 2-Clause "Simplified" License
409 stars 157 forks source link

Zero-Masking with RNN/LSTM #210

Open quanpn90 opened 8 years ago

quanpn90 commented 8 years ago

Thanks everyone for the wonderful cudnn bindings,

I would like to ask if NVIDIA provides any interface for masking the hiddenOutputs at each step, provided that the input sequence is padded (neural machine translation for example).

Concretely:

The input sequence is padded with 0s, such as

seq = torch.Tensor({{0,0,0,0,1,2,3},{0,0,4,5,6,7,8}, {0,0,0,1,2,3,4}}):t():cuda() Thanks to the LookupTableMaskZero from the rnn package, we can proceed to the LSTM with the zero embeddings at "0" indexes. I wonder if the cudnn LSTM can mask the hiddenOutput based on the input ? My current solution is to zero out the LSTM biases, so that the hidden layers at padded positions will always be zero. But I am not sure if it affects the learning process.

Thank you,

ngimel commented 8 years ago

cudnn RNN/LSTM accepts inputs with the different sequence length and thus does not require padding. The requirement is that inputs be sorted in the descending order of sequence length. This capability is not yet supported with torch bindings, though.

ceberly commented 8 years ago

@ngimel just out of curiosity, where did you get that information about the sequence length sorting? I'm trying to desperately to get the LSTM layer working.

ngimel commented 8 years ago

From the manual:-) In cudnnRNNForwardTraining entry "The first dimension of the tensors may decrease from element n to element n+1 but may not increase."

ceberly commented 8 years ago

I guess my question is, what manual? :) I can't find anything except the cudnn.h header file and that information is not in it. I also can't find it in the CUDA manuals. Maybe i'm just being dumb here.

ceberly commented 8 years ago

nm i found it, i think when i downloaded v5 they didn't have a link to the user guide yet. Thank you, sorry!

quanpn90 commented 8 years ago

@ngimel Hi,

I want to group sequences with different length into a batch (sentences for example) so padding is necessary. By the way, I will try disabling bias while learning and see if any problem arises. Thank you.

nicholas-leonard commented 8 years ago

@soumith It would be a nice feature request for the next NVIDIA cudnn release. The lack of zero-masking is the only reason I am still not using cudnn LSTMs.

ngimel commented 8 years ago

Sequences with the different length can already be grouped into a batch without padding, cudnn supports that. Torch bindings don't, at the moment.

nicholas-leonard commented 8 years ago

@ngimel I am guessing that each row of the batch as exactly one sequence? If so, this is not the same as zero-masking.

nhynes commented 8 years ago

I, too, was wondering if feeding in zero-padded variable length sequences would significantly affect learning a good final hidden state. I wrote a quick script to convince myself that it doesn't if the RNN dimension is high enough.

I've found that it's (at least conceptually) simpler to just group sequences of the same length.

uprightws commented 7 years ago

@ngimel hi, could you please show me a demo of "Sequences with the different length can already be grouped into a batch without padding". THANK YOU

ngimel commented 7 years ago

Look at variable length sequences test for an example of how it can be done https://github.com/soumith/cudnn.torch/blob/master/test/test_rnn.lua#L324

leezu commented 7 years ago

@ngimel getting back to your comment from last year on variable length sequence support in cudnn. I think it doesn't refer to sequence length but batch size. It seems some clarification were added in the current manual (cudnnRNNForwardTraining):

The first dimension (batch size) of the tensors may decrease from element n to element n+1 but may not increase. Eachtensor descriptor must have the same second dimension (vector length).

Or are you referring to something different? Please let me know if there is a misunderstanding.

Ok, nevermind. I had an unrolled version of the operator in mind.. When iteratively calling cudnnRNNForwardTraining for each time step reducing the batch size does of course work as you mentioned.