saahiluppal / catr

Image Captioning Using Transformer
Apache License 2.0
254 stars 53 forks source link

How to slice <eos> token with different sentence length #23

Open rshaojimmy opened 2 years ago

rshaojimmy commented 2 years ago

As I want the model to predict the end token by excluding it from the input into the model, I simply slice the token off the end of the sequence. Thus:

trg = [sos, x_1, x_2, x_3, eos] trg[:-1] = [sos, x_1, x_2, x_3]

This is also same as your implementation.

But actually many datasets collect sentences with different length, ans thus the last elements of sentences are tokens, such as:

trg = [sos, x_1, x_2, x_3, eos, pad, pad, pad] trg[:-1] = [sos, x_1, x_2, x_3, eos, pad, pad]

In such a case, I can’t slice the token, may I ask how can I solve this issue?

saahiluppal commented 2 years ago
while <pad> in array:
    remove <pad> from array

remove <eos> from array
rshaojimmy commented 2 years ago

Thanks for your quick reply!

But if I remove eos from array, how can model learn to stop generating sentence without encountering the eos token?

saahiluppal commented 2 years ago

Model itself will predict the eos token.

If the model doesn't predict eos token, and the entire sentence is gibberish, then the model isn't generalized well or data is insufficient.

On Wed, 2 Feb, 2022, 7:38 am Rui Shao, @.***> wrote:

Thanks for your quick reply!

But if I remove from array, how can model learn to stop generating sentence without encountering the token?

— Reply to this email directly, view it on GitHub https://github.com/saahiluppal/catr/issues/23#issuecomment-1027507088, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALJ7DKFWWHLASC3EPZFGGODUZCG2XANCNFSM5NJLJS4Q . You are receiving this because you commented.Message ID: @.***>

rshaojimmy commented 2 years ago

But we should let trg[:-1] have eos token when we calculate the loss, right? like this: trg[:-1] = [x_1, x_2, x_3, eos, pad, pad] or trg[:-1] = [x_1, x_2, x_3, eos]

saahiluppal commented 2 years ago

Depends on your training dataset.

If your dataset have special tokens like and , then yes, these should be considered in loss.

While tokens do not contribute to loss.

On Thu, 3 Feb, 2022, 7:47 am Rui Shao, @.***> wrote:

But we should let trg[:-1] have eos token when we calculate the loss, right? like this: trg[:-1] = [sos, x_1, x_2, x_3, eos, pad, pad] or trg[:-1] = [sos, x_1, x_2, x_3, eos]

— Reply to this email directly, view it on GitHub https://github.com/saahiluppal/catr/issues/23#issuecomment-1028540171, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALJ7DKBJ734ASWNV7LUJFFLUZHQVPANCNFSM5NJLJS4Q . You are receiving this because you commented.Message ID: @.***>

rshaojimmy commented 2 years ago

Thanks.

In all, I just want to create a dataset with sequences of different lengths. In such a dataset, I insert bos, eos in into the beginning and end of each sequence as the ground-truth. like this:

caps = [sos, x_1, x_2, x_3, eos]

In such a case,

caps[:, :-1] = [sos, x_1, x_2, x_3]
caps[:, 1:] = [x_1, x_2, x_3, eos]

This is what we want for the loss calculation.

outputs = model(samples, caps[:, :-1], cap_masks[:, :-1])
loss = criterion(outputs.permute(0, 2, 1), caps[:, 1:])

However, given different lengths, I have to further insert pad tokens to make them consistent, such as:

caps = [sos, x_1, x_2, x_3, eos, pad, pad, pad]

In such case,

caps[:, :-1] = [sos, x_1, x_2, x_3, eos, pad, pad]
caps[:, 1:] = [x_1, x_2, x_3, eos, pad, pad, pad]

The input of model (caps[:, :-1]) will contain the eos token, which we want to remove.

Considering this, I just further replace the eos token with pad token as pad token will not be calculated for the loss, like this:

caps[:, :-1] = [sos, x_1, x_2, x_3, pad, pad, pad]

And I remain the caps[:, 1:] as

caps[:, 1:] = [x_1, x_2, x_3, eos, pad, pad, pad].

May I ask does this make sense?

saahiluppal commented 2 years ago

you should consider eos token in loss. Because you want your model to learn when to stop generating a sentence.

On Thu, 3 Feb, 2022, 1:03 pm Rui Shao, @.***> wrote:

Thanks.

In all, I just want to create a dataset with sequences of different lengths. In such a dataset, I insert bos, eos in into the beginning and end of each sequence as the ground-truth. like this:

caps = [sos, x_1, x_2, x_3, eos]

In such a case,

caps[:, :-1] = [sos, x_1, x_2, x_3] caps[:, 1:] = [x_1, x_2, x_3, eos]

This is what we want for the loss calculation.

outputs = model(samples, caps[:, :-1], cap_masks[:, :-1]) loss = criterion(outputs.permute(0, 2, 1), caps[:, 1:])

However, given different lengths, I have to further insert pad tokens to make them consistent, such as:

caps = [sos, x_1, x_2, x_3, eos, pad, pad, pad]

In such case,

caps[:, :-1] = [sos, x_1, x_2, x_3, eos, pad, pad] caps[:, 1:] = [x_1, x_2, x_3, eos, pad, pad, pad]

The input of model (caps[:, :-1]) will contain the eos token, which we want to remove.

Considering this, I just further replace the eos token with pad token as pad token will not be calculated for the loss, like this:

caps[:, :-1] = [sos, x_1, x_2, x_3, pad, pad, pad]

And I remain the caps[:, 1:] as caps[:, 1:] = [x_1, x_2, x_3, eos, pad, pad, pad].

May I ask does this make sense?

— Reply to this email directly, view it on GitHub https://github.com/saahiluppal/catr/issues/23#issuecomment-1028683489, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALJ7DKAEX266V3MZ7V4YBDTUZIVTTANCNFSM5NJLJS4Q . You are receiving this because you commented.Message ID: @.***>