Word-level weighting: end of sentence?

kpu commented 6 years ago

Word-level weights are specified for each token in the surface form of the sentence. But the framework is adding </s> to every sentence and one does not currently specify a weight for this. We are unclear whether Marian currently weights that as 0, 1, or undefined.
I think the user should be specifying the weight of </s> as well. This way sentence weighting will be equivalent to word-level weighting if all the weights are the same. And it forces the user to be aware of this issue.

cc @PinzhenChen @fiqas

snukky commented 6 years ago

Weights for EOS and for words in further positions (if the sentence is shorter than the longest one in the batch) are 1: https://github.com/marian-nmt/marian-dev/blob/master/src/data/corpus_base.h#L334

kpu commented 6 years ago

Is there error checking that the number of weights corresponds to the sentence length?

snukky commented 6 years ago

No. I think a warning would be reasonable in that case.

frankseide commented 6 years ago

Would one typically want a weight != 1 for EOS? I feel that the number of tokens in the main text and the weights should be the same. So if EOS is not explicitly included in the data, it should also not be included in the weights and default to 1. If users want a different weight, they would include EOS explicitly in the data as well. Would that make sense? Only arguing from consistency between data files.

kpu commented 6 years ago

When setting word-level weights, one would typically want to also set the end of sentence weight. I can't think of a use case for setting weights but leaving the end of sentence as 1.

Yes, one more column of weights than words is inconsistent. But I think being inconsistent about the word input format (i.e. sometimes you append </s> to the sentence) would be worse.

fiqas commented 6 years ago

I'm confused because when I was doing my experiments, I haven't seen </s>in a batch? When printing Ptr<data::CorpusBatch>, I don't see</s> added in it:

So my weights are only based on what's in the actual batch. Is </s>hardcoded somewhere?

Is it like first layer of 0 is EOF and the rest is padding? How is that treated?

frankseide commented 6 years ago

Note that the output above should be read vertically. </s> has a hard-coded value if you don't explicitly include it in the vocabulary:

const Word DEFAULT_EOS_ID = 0;
const Word DEFAULT_UNK_ID = 1;

fiqas commented 6 years ago

I know it's 0 in vocab, but the matrix is padded with zeroes too.

23.10.2018 17:58 "Frank Seide" notifications@github.com napisał(a):

Note that the output above should be read vertically. has a hard-coded value if you don't explicitly include it in the vocabulary:

const Word DEFAULT_EOS_ID = 0; const Word DEFAULT_UNK_ID = 1;

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian-dev/issues/354#issuecomment-432304806, or mute the thread https://github.com/notifications/unsubscribe-auth/AItLJyczYabOvwsO8oLaOlIUPSxx4Nbeks5unzyRgaJpZM4X1PwR .