Closed kpu closed 5 years ago
Weights for EOS and for words in further positions (if the sentence is shorter than the longest one in the batch) are 1: https://github.com/marian-nmt/marian-dev/blob/master/src/data/corpus_base.h#L334
Is there error checking that the number of weights corresponds to the sentence length?
No. I think a warning would be reasonable in that case.
Would one typically want a weight != 1 for EOS? I feel that the number of tokens in the main text and the weights should be the same. So if EOS is not explicitly included in the data, it should also not be included in the weights and default to 1. If users want a different weight, they would include EOS explicitly in the data as well. Would that make sense? Only arguing from consistency between data files.
When setting word-level weights, one would typically want to also set the end of sentence weight. I can't think of a use case for setting weights but leaving the end of sentence as 1.
Yes, one more column of weights than words is inconsistent. But I think being inconsistent about the word input format (i.e. sometimes you append </s>
to the sentence) would be worse.
I'm confused because when I was doing my experiments, I haven't seen </s>
in a batch?
When printing Ptr<data::CorpusBatch>
, I don't see</s>
added in it:
So my weights are only based on what's in the actual batch. Is </s>
hardcoded somewhere?
Is it like first layer of 0
is EOF and the rest is padding? How is that treated?
Note that the output above should be read vertically. </s>
has a hard-coded value if you don't explicitly include it in the vocabulary:
const Word DEFAULT_EOS_ID = 0;
const Word DEFAULT_UNK_ID = 1;
I know it's 0 in vocab, but the matrix is padded with zeroes too.
23.10.2018 17:58 "Frank Seide" notifications@github.com napisał(a):
Note that the output above should be read vertically. has a hard-coded value if you don't explicitly include it in the vocabulary:
const Word DEFAULT_EOS_ID = 0; const Word DEFAULT_UNK_ID = 1;
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian-dev/issues/354#issuecomment-432304806, or mute the thread https://github.com/notifications/unsubscribe-auth/AItLJyczYabOvwsO8oLaOlIUPSxx4Nbeks5unzyRgaJpZM4X1PwR .
Think of it as Hello world </s> </s> </s>
.
I was always wondering whether we should use a special padding symbol.
@snukky Also, if padding weights to 1 works, does just adding one additional weight work too and actually weight </s>
and beyond?
@emjotde, with special padding symbol, do you mean a vocab entry, or an index that is obviously invalid, such as -1?
More philosophically (or empirically) if Hello world </s> <pad> <pad>
is in any way better than repeating </s>
. On the other hand, the spurious </s>
aren't getting any gradients due to masking anyway.
That is a huge issue for me, because in my weighting I was counting frequencies of words in batches as training progresses and treating 0 as any other word, assuming that they are going to be zeroed anyway because it's only padding. That means I was overcounting </s>
and probably assigning a very small p(EOF)...
@fiqas That would be a mistake in your counting though, not in our handling </s>
.
I know, but distinguishing between and padding would be a good idea.
Not if the only reason for doing this is avoiding incorrect counting.
So how would you check if 0 is EOF or padding now?
On 23 Oct 2018, at 19:25, Marcin Junczys-Dowmunt notifications@github.com wrote:
Not if the only reason for doing this is avoiding incorrect counting.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marian-nmt/marian-dev/issues/354#issuecomment-432340564, or mute the thread https://github.com/notifications/unsubscribe-auth/AItLJwfTMKzaOn00y_h95t6QOEAunZQRks5un1D0gaJpZM4X1PwR.
Look at the mask. If in the batch the word symbol is 0, but the corresponding flag in the mask is 1, it’s a proper EOS. Batch and mask have the same number of elements laid out in the same order. So, you compare at the same indices.
A 0 in the mask indicates a non-word -> padding.
Are you parsing the log output? Then you would need to make some small modifications. It would be here (corpus_base.h
), where you could introduce a check for sb->mask()[idx]
:
for(size_t i = 0; i < sb->batchWidth(); i++) {
std::cerr << "\t w: ";
for(size_t j = 0; j < sb->batchSize(); j++) {
size_t idx = i * sb->batchSize() + j;
Word w = sb->data()[idx];
if (vocab)
std::cerr << (*vocab)[w] << " ";
Also, the number of real </s>
is equal to the batch size (in terms of sentences). You don't even need to count them.
@snukky What's the verdict here, does word level weighting allow to weight <\s>
just by making the weights string longer by one item?
Yes, adding a weight for EOS works. I added a regression test for this.
Do we want to have a warning if a sentence with n
tokens has less than n
or more than n+1
weights specified?
Yeah, let's warn with line number.
Word-level weights are specified for each token in the surface form of the sentence. But the framework is adding
</s>
to every sentence and one does not currently specify a weight for this. We are unclear whether Marian currently weights that as 0, 1, or undefined.I think the user should be specifying the weight of
</s>
as well. This way sentence weighting will be equivalent to word-level weighting if all the weights are the same. And it forces the user to be aware of this issue.cc @PinzhenChen @fiqas