salesforce / simpletod

Official repository for "SimpleTOD: A Simple Language Model for Task-Oriented Dialogue"
https://arxiv.org/abs/2005.00796
BSD 3-Clause "New" or "Revised" License
237 stars 78 forks source link

Shouldn't context be masked during training? #24

Open yuanzhaoz opened 3 years ago

yuanzhaoz commented 3 years ago

If I understand correctly the idea should be that model generate belief states, dbsearch results, action and response conditioned on some dialog context. Then shouldn't we mask the context in between <|context|> and <|endofcontext|> during training? Because we are not trying to generate past dialogue history, but generate belief states, etc, given past dialogue history.

Basically what I mean is that we modify labels on this line https://github.com/salesforce/simpletod/blob/8d694bc1b09c12497488be46879dfe1dede83df3/main.py#L152 to mask the context before passing it to the model

pleomax0730 commented 3 years ago

If I understand correctly the idea should be that model generate belief states, dbsearch results, action and response conditioned on some dialog context. Then shouldn't we mask the context in between <|context|> and <|endofcontext|> during training? Because we are not trying to generate past dialogue history, but generate belief states, etc, given past dialogue history.

Basically what I mean is that we modify labels on this line

https://github.com/salesforce/simpletod/blob/8d694bc1b09c12497488be46879dfe1dede83df3/main.py#L152

to mask the context before passing it to the model

Incorrect: I think in GPT-2 by masking the dialogue history somehow complicates the training process.

Correct: We should train our model with full input rather than mask the dialogue history.

yuanzhaoz commented 3 years ago

If I understand correctly the idea should be that model generate belief states, dbsearch results, action and response conditioned on some dialog context. Then shouldn't we mask the context in between <|context|> and <|endofcontext|> during training? Because we are not trying to generate past dialogue history, but generate belief states, etc, given past dialogue history. Basically what I mean is that we modify labels on this line https://github.com/salesforce/simpletod/blob/8d694bc1b09c12497488be46879dfe1dede83df3/main.py#L152

to mask the context before passing it to the model

I think in GPT-2 by masking the dialogue history somehow complicates the training process.

Hi can you clarify what do you mean by complicating the training process? Also whether it complicates the process or not, I think we should do it the right way (by masking out context to my understanding). Anyway this is how I'm doing it right now: userbelief_token = tokenizer.encode('<|belief|>')[0] labels = ((torch.cumsum((labels == userbelief_token),dim=1)*labels) == 0)*-100 + torch.cumsum((labels == userbelief_token),dim=1)*labels

pleomax0730 commented 3 years ago

If I understand correctly the idea should be that model generate belief states, dbsearch results, action and response conditioned on some dialog context. Then shouldn't we mask the context in between <|context|> and <|endofcontext|> during training? Because we are not trying to generate past dialogue history, but generate belief states, etc, given past dialogue history. Basically what I mean is that we modify labels on this line https://github.com/salesforce/simpletod/blob/8d694bc1b09c12497488be46879dfe1dede83df3/main.py#L152

to mask the context before passing it to the model

I think in GPT-2 by masking the dialogue history somehow complicates the training process.

Hi can you clarify what do you mean by complicating the training process? Also whether it complicates the process or not, I think we should do it the right way (by masking out context to my understanding). Anyway this is how I'm doing it right now: userbelief_token = tokenizer.encode('<|belief|>')[0] labels = ((torch.cumsum((labels == userbelief_token),dim=1)*labels) == 0)*-100 + torch.cumsum((labels == userbelief_token),dim=1)*labels

I'll explain my thought as clear as possible.

Given a GPT-2 - an autoregressive model, if you consider the belief token as the label or what the model needs to learn, there is going to be a problem when doing inference.

If you train your model with belief tokens only, the model indeed can generate belief token, but that generated belief token which depend on '<|belief|>' as the start token does not consider the history context.

What I'm trying to say is that our model still needs to learn dialogue history, and our belief state can be generated depends on different history context. So Given different contexts, our model knows what belief states need to be generated.

By training dialogue history with belief state, the hyperspace in our model knows which belief states are closer to or far away from different history contexts.

Sorry for saying that masking the dialogue history complicates the training process. We should train our model with full input rather than mask the dialogue history.

yuanzhaoz commented 3 years ago

If I understand correctly the idea should be that model generate belief states, dbsearch results, action and response conditioned on some dialog context. Then shouldn't we mask the context in between <|context|> and <|endofcontext|> during training? Because we are not trying to generate past dialogue history, but generate belief states, etc, given past dialogue history. Basically what I mean is that we modify labels on this line https://github.com/salesforce/simpletod/blob/8d694bc1b09c12497488be46879dfe1dede83df3/main.py#L152

to mask the context before passing it to the model

I think in GPT-2 by masking the dialogue history somehow complicates the training process.

Hi can you clarify what do you mean by complicating the training process? Also whether it complicates the process or not, I think we should do it the right way (by masking out context to my understanding). Anyway this is how I'm doing it right now: userbelief_token = tokenizer.encode('<|belief|>')[0] labels = ((torch.cumsum((labels == userbelief_token),dim=1)*labels) == 0)*-100 + torch.cumsum((labels == userbelief_token),dim=1)*labels

I'll explain my thought as clear as possible.

Given a GPT-2 - an autoregressive model, if you consider the belief token as the label or what the model needs to learn, there is going to be a problem when doing inference.

If you train your model with belief tokens only, the model indeed can generate belief token, but that generated belief token which depend on '<|belief|>' as the start token does not consider the history context.

What I'm trying to say is that our model still needs to learn dialogue history, and our belief state can be generated depends on different history context. So Given different contexts, our model knows what belief states need to be generated.

By training dialogue history with belief state, the hyperspace in our model knows which belief states are closer to or far away from different history contexts.

Sorry for saying that masking the dialogue history complicates the training process. We should train our model with full input rather than mask the dialogue history.

Hi I see what you mean. Maybe I was not clear when I said masking. I only want to mask the labels, that is, set the loss for context tokens to be zero (by setting the corresponding locations of context to be -100 when passing labels), but I still pass the unmasked sequence to the model as input, so the output belief, etc will still be conditioned on context, while at the same time context does not contribute to the loss. I hope this is clearer.

pleomax0730 commented 3 years ago

Hi I see what you mean. Maybe I was not clear when I said masking. I only want to mask the labels, that is, set the loss for context tokens to be zero (by setting the corresponding locations of context to be -100 when passing labels), but I still pass the unmasked sequence to the model as input, so the output belief, etc will still be conditioned on context, while at the same time context does not contribute to the loss. I hope this is clearer.

I see...It's like the UniLM from Microsoft. Can you let me know your result after training?

YovaKem commented 3 years ago

@ericZYZ , I'd be curious to know too if you able to obtain any results and how did they compare to the ones reported in the paper?

yuanzhaoz commented 3 years ago

@ericZYZ , I'd be curious to know too if you able to obtain any results and how did they compare to the ones reported in the paper?

Sorry I didn't run it against MultiWOZ. I'm using this masked version on a private dialogue dataset and it works pretty well

pleomax0730 commented 3 years ago

@ericZYZ , I'd be curious to know too if you able to obtain any results and how did they compare to the ones reported in the paper?

Sorry I didn't run it against MultiWOZ. I'm using this masked version on a private dialogue dataset and it works pretty well

So the performance of the masked version on private dataset works better than default gpt2 model.