taoshen58 / DiSAN

Code of Directional Self-Attention Network (DiSAN)
Apache License 2.0
313 stars 68 forks source link

what to do about the var rep_mask in the disan.py? #18

Open Jackerry-H opened 5 years ago

Jackerry-H commented 5 years ago

Hi, thanks for your contributions. I'm confused about the variable about rep_mask in the DiSA block. The Positional Mask make Elment-wise Add op in the figure 2, but in your code about the function directional_attention_with_dense() : rep_mask_tile=... and attn_mask=... what effect about this 2 lines in this function ? Another confusion: what to do about the functions mask_for_high_rank() and exp_mask_for_high_rank() ? thank for your attention.

taoshen58 commented 5 years ago

the rep_mask is similar to 2D version of sequence_length in dynamic_rnn, which is a tf.bool type with the shape of [batch_size, max_sequence_lenght], in contrast to 1d sequence_length with the shape pf [batch_size]

Jackerry-H commented 5 years ago

why use the _repmask in the disan.py? The DiSA block only need the embed word vector as the input, why add the rep_mask ? I'm not clearly

taoshen58 commented 5 years ago

The module needs to know which words are valid and which words are padding ones.

zhengyima commented 5 years ago

hello. i want to know which type of words in rep_mask is True? valid ones or padding ones? thank u!

Vichoko commented 5 years ago

So whats the criteria to set a value trueor falsein the rep_mask input tensor?

taoshen58 commented 5 years ago

@Vichoko true for a valid token in a sentence and false for padding token. Alternatively, you can use tf.cast(attention_mask, tf.bool)as the input, where attention_mask is the mask format for BERT model.

Vichoko commented 5 years ago

Thank you @taoshen58 for the answer. So this mask exist to hide some sequence elements to the self-attention mechansims, specially useful in ML tasks that requires masking like BERT Masked LM where some words are masked?

If masking is only for padding why would be that needed if DiSAN works perfectly for arbitrary length sequences? Why would someone include padding if the input sequence could be clipped to just include the needed sequence for the given task.

taoshen58 commented 5 years ago

@Vichoko because input data are batched for efficiency, and it is impossible that all sequences in a batch are equal-length.

Vichoko commented 5 years ago

Thank you for your time dear taoshen5. I think i didn't express my doubt good enough in the previous comment, because i didn't get your answer; i'd appreciate more help with this.

Why is padding needed in a model whose grace is to accept entries of arbitrary lenghts? Why would this be related to efficiency?

I'm trying to understand the workflow for applying this architecture for a supervised learning task that i hope to publish in the upcoming year; so i've been reading the SST_disan code for further understanding.

Here the rep_mask is self.token_mask, which is the product of tf.cast(self.token_seq, tf.bool). self.token_seq is a tensor of int32, so if i get it right. In this case the mask is always true for all elements, because self.token_seq is a list of token ids which are always different than 0.

I can't reach more info about self.token_seq, because it's initialization is kind of empty: self.token_seq = tf.placeholder(tf.int32, [None, None], name='token_seq')

So, in the practice, this mask can be all true for a supervised aproach?

Vichoko commented 5 years ago

I think i'm getting closer to the answer. Please tell me if i'm right with the following.

Input data is stacked in batches of shape(#_of_sequences_in_batch, Max_sequence_len_in_batch), so the masking is used for padding the sequence which length is lesser than the max length sequence in that batch?