timbmg / Sentence-VAE

PyTorch Re-Implementation of "Generating Sentences from a Continuous Space" by Bowman et al 2015 https://arxiv.org/abs/1511.06349
580 stars 152 forks source link

Repeated content #7

Open nguyenvo09 opened 5 years ago

nguyenvo09 commented 5 years ago

I used your code and trained a model to generate new sentences. The problem is that there are so many repeated tokens in generated samples.

Any insight how to deal with this?

For example, token appears so many times.

https://pastebin.com/caxz43CQ

timbmg commented 5 years ago

For how long did you train? What was your final KL/NLL Loss? Also with what min_occ did you train?

Also, when looking at it, the samples actually don't look that bad. Certainly, there is a problem with <unk> tokens, that they might be repeated many times before finally an <eos> token is produced. However, I think that's expected, since the network really does not know what <unk> is, so there actually can be any number of <unk>'s. I think if you move on to another dataset, where the training and validation set are more similar, you should have less <unk>'s produced.

preke commented 5 years ago

Is that seq2seq-like model you want to implement? I have the same problem met. It seems like when training, the input of the decoder also have to be sorted by length. While in the evaluation part, we do not have prior knowledge of the lengths of the sentences we want to generate, so, this part of the information is kind of lost. Also, it seems seq2seq-like decoder could only be implemented by RNNLM, is that true?(like the code below:

            t = 0
            while(t < self.max_sequence_length-1):
                if t == 0:
                    input_sequence = Variable(torch.LongTensor([self.sos_idx] * batch_size), volatile=True)
                    if torch.cuda.is_available():
                        input_sequence = input_sequence.cuda()
                        outputs        = outputs.cuda()

                input_sequence  = input_sequence.unsqueeze(1)
                input_embedding = self.embedding(input_sequence) # b * e
                output, hidden  = self.decoder_rnn(input_embedding, hidden) 
                logits          = self.outputs2vocab(output) # b * v
                outputs[:,t,:]  = nn.functional.log_softmax(logits, dim=-1).squeeze(1)  # b * v 
                input_sequence  = self._sample(logits)
                t += 1

            outputs = outputs.view(batch_size, self.max_sequence_length, self.embedding.num_embeddings)