`per_input_ids=persona_input_ids` seems not be used when `ul_training`

cingtiye commented 2 years ago

 if ul_training:
                decoder_input_ids=inference_dict['neg_hyp_input_ids']
                hyp_attention_mask=inference_dict['neg_hyp_attention_mask']
                mask_flag = torch.Tensor.bool(1 - hyp_attention_mask)
                labels = decoder_input_ids.masked_fill(mask_flag, -100)
                persona_input_ids=inference_dict['neg_pre_input_ids']

                ul_outputs = self.decoder2(
                    input_ids=decoder_input_ids,
                    attention_mask=hyp_attention_mask,
                    encoder_hidden_states=None,
                    encoder_attention_mask=None,
                    inputs_embeds=None,
                    labels=labels,
                    output_attentions=output_attentions,
                    output_hidden_states=output_hidden_states,
                    return_dict=return_dict,
                    per_input_ids=persona_input_ids,
                    ul_training=ul_training,
                    **kwargs_decoder2,
                )

because encoder_hidden_states=None, only self.attention can be executed, other related attention code seems can't be executed. Thus the above code per_input_ids=persona_input_ids seems not be used.

So, hyp is generated by what? Please.

haoyusoong commented 2 years ago

I didn't quite understand what you mean. Here I try to address it from my speculation: the self.decoder2 itself is also a full BERT model and it works like a normal BERT, which takes the input_ids, as well as the per_input_ids in this project, as input, if without the encoder_hidden_states. Actually, when encoder_hidden_states is None, the huggingface library will treat the self.decoder2 as an encoder rather than a decoder. Anyway, it wouldn't bother the ul_training whether the model is an encoder or a decoder.

cingtiye commented 2 years ago

Thanks. For the following code line 382

class BertLayer(nn.Module):
    def __init__(self, config):
       """..."""

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
        per_hidden_states=None,
    ):

        if self.is_decoder2 and encoder_hidden_states is not None:
            per_attention_outputs = self.attention(
                per_hidden_states,
                None,
                head_mask,
                output_attentions=output_attentions,
            )
            per_attention_output = per_attention_outputs[0]

        self_attention_outputs = self.attention(
            hidden_states,
            attention_mask,
            head_mask,
            output_attentions=output_attentions,
        )
        attention_output = self_attention_outputs[0]

        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights

        if self.is_decoder and encoder_hidden_states is not None and not self.is_decoder2:
            assert hasattr(
                self, "crossattention"
            ), f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`"
            cross_attention_outputs = self.crossattention(
                attention_output,
                attention_mask,
                head_mask,
                encoder_hidden_states,
                encoder_attention_mask,
                output_attentions,
            )
            attention_output = cross_attention_outputs[0]
            outputs = outputs + cross_attention_outputs[1:]

        elif self.is_decoder2 and encoder_hidden_states is not None:
            assert hasattr(
                self, "crossattention"
            ), f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`"
            query_hidden_states = self.crossattention(
                attention_output,
                None,
                head_mask,
                per_attention_output,
                None,
                output_attentions,
            )[0]
            cross_attention_outputs = self.crossattention(
                query_hidden_states,
                attention_mask,
                head_mask,
                encoder_hidden_states,
                encoder_attention_mask,
                output_attentions,
            )
            attention_output = cross_attention_outputs[0]
            outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights

        layer_output = apply_chunking_to_forward(
            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
        )
        outputs = (layer_output,) + outputs
        return outputs

    def feed_forward_chunk(self, attention_output):
        intermediate_output = self.intermediate(attention_output)
        layer_output = self.output(intermediate_output, attention_output)
        return layer_output

When ul_training=True and encoder_hidden_states=None, only line 417 will be executed for the code in the link above. line 417 about self.attention excludes the parameter per_hidden_states (from per_input_ids=persona_input_ids=inference_dict['neg_pre_input_ids']). Other attention function with per_hidden_states can not be executed. Thus, I can't understand how hyb is generated?

haoyusoong commented 2 years ago

I see your point. In ul_training the hypis not generated but discouraged by the unlikely objectives, which is implemented by reversing the cross-entropy loss.

songhaoyu / BoB

`per_input_ids=persona_input_ids` seems not be used when `ul_training` #12