marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
257 stars 127 forks source link

Word-level weighting: end of sentence? #354

Closed kpu closed 5 years ago

kpu commented 6 years ago

Word-level weights are specified for each token in the surface form of the sentence. But the framework is adding </s> to every sentence and one does not currently specify a weight for this. We are unclear whether Marian currently weights that as 0, 1, or undefined.
I think the user should be specifying the weight of </s> as well. This way sentence weighting will be equivalent to word-level weighting if all the weights are the same. And it forces the user to be aware of this issue.

cc @PinzhenChen @fiqas

snukky commented 6 years ago

Weights for EOS and for words in further positions (if the sentence is shorter than the longest one in the batch) are 1: https://github.com/marian-nmt/marian-dev/blob/master/src/data/corpus_base.h#L334

kpu commented 6 years ago

Is there error checking that the number of weights corresponds to the sentence length?

snukky commented 6 years ago

No. I think a warning would be reasonable in that case.

frankseide commented 6 years ago

Would one typically want a weight != 1 for EOS? I feel that the number of tokens in the main text and the weights should be the same. So if EOS is not explicitly included in the data, it should also not be included in the weights and default to 1. If users want a different weight, they would include EOS explicitly in the data as well. Would that make sense? Only arguing from consistency between data files.

kpu commented 6 years ago

When setting word-level weights, one would typically want to also set the end of sentence weight. I can't think of a use case for setting weights but leaving the end of sentence as 1.

Yes, one more column of weights than words is inconsistent. But I think being inconsistent about the word input format (i.e. sometimes you append </s> to the sentence) would be worse.

fiqas commented 6 years ago

I'm confused because when I was doing my experiments, I haven't seen </s>in a batch? When printing Ptr<data::CorpusBatch>, I don't see</s> added in it:

screen shot 2018-10-23 at 17 20 49

So my weights are only based on what's in the actual batch. Is </s>hardcoded somewhere?

Is it like first layer of 0 is EOF and the rest is padding? How is that treated?

frankseide commented 6 years ago

Note that the output above should be read vertically. </s> has a hard-coded value if you don't explicitly include it in the vocabulary:

const Word DEFAULT_EOS_ID = 0;
const Word DEFAULT_UNK_ID = 1;
fiqas commented 6 years ago

I know it's 0 in vocab, but the matrix is padded with zeroes too.

23.10.2018 17:58 "Frank Seide" notifications@github.com napisał(a):

Note that the output above should be read vertically. has a hard-coded value if you don't explicitly include it in the vocabulary:

const Word DEFAULT_EOS_ID = 0; const Word DEFAULT_UNK_ID = 1;

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian-dev/issues/354#issuecomment-432304806, or mute the thread https://github.com/notifications/unsubscribe-auth/AItLJyczYabOvwsO8oLaOlIUPSxx4Nbeks5unzyRgaJpZM4X1PwR .

emjotde commented 6 years ago

Think of it as Hello world </s> </s> </s> . I was always wondering whether we should use a special padding symbol.

emjotde commented 6 years ago

@snukky Also, if padding weights to 1 works, does just adding one additional weight work too and actually weight </s> and beyond?

frankseide commented 6 years ago

@emjotde, with special padding symbol, do you mean a vocab entry, or an index that is obviously invalid, such as -1?

emjotde commented 6 years ago

More philosophically (or empirically) if Hello world </s> <pad> <pad> is in any way better than repeating </s>. On the other hand, the spurious </s> aren't getting any gradients due to masking anyway.

fiqas commented 6 years ago

That is a huge issue for me, because in my weighting I was counting frequencies of words in batches as training progresses and treating 0 as any other word, assuming that they are going to be zeroed anyway because it's only padding. That means I was overcounting </s> and probably assigning a very small p(EOF)...

emjotde commented 6 years ago

@fiqas That would be a mistake in your counting though, not in our handling </s>.

fiqas commented 6 years ago

I know, but distinguishing between and padding would be a good idea.

emjotde commented 6 years ago

Not if the only reason for doing this is avoiding incorrect counting.

fiqas commented 6 years ago

So how would you check if 0 is EOF or padding now?

On 23 Oct 2018, at 19:25, Marcin Junczys-Dowmunt notifications@github.com wrote:

Not if the only reason for doing this is avoiding incorrect counting.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian-dev/issues/354#issuecomment-432340564, or mute the thread https://github.com/notifications/unsubscribe-auth/AItLJwfTMKzaOn00y_h95t6QOEAunZQRks5un1D0gaJpZM4X1PwR.

emjotde commented 6 years ago

Look at the mask. If in the batch the word symbol is 0, but the corresponding flag in the mask is 1, it’s a proper EOS. Batch and mask have the same number of elements laid out in the same order. So, you compare at the same indices.

A 0 in the mask indicates a non-word -> padding.

frankseide commented 6 years ago

Are you parsing the log output? Then you would need to make some small modifications. It would be here (corpus_base.h), where you could introduce a check for sb->mask()[idx]:

  for(size_t i = 0; i < sb->batchWidth(); i++) {
    std::cerr << "\t w: ";
    for(size_t j = 0; j < sb->batchSize(); j++) {
      size_t idx = i * sb->batchSize() + j;
      Word w = sb->data()[idx];
      if (vocab)
        std::cerr << (*vocab)[w] << " ";
emjotde commented 6 years ago

Also, the number of real </s> is equal to the batch size (in terms of sentences). You don't even need to count them.

emjotde commented 6 years ago

@snukky What's the verdict here, does word level weighting allow to weight <\s> just by making the weights string longer by one item?

snukky commented 6 years ago

Yes, adding a weight for EOS works. I added a regression test for this.

Do we want to have a warning if a sentence with n tokens has less than n or more than n+1 weights specified?

emjotde commented 6 years ago

Yeah, let's warn with line number.