Question about the processed data

ryokamoi / dcnn_textvae

TensorFlow implementation of "Improved Variational Autoencoders for Text Modeling using Dilated Convolutions"

MIT License

55 stars 11 forks source link

Question about the processed data #1

Closed ConanCui closed 6 years ago

ConanCui commented 6 years ago

Hi, I am a beginner in NLP, I saw the go_input function in batchloader.py, but I couldn't find where it is used, the 'GO' token in nlp application is used for what, and why it isn't used in the code. And can I bother you to provide a whole project with some data to show the how the code run with data, or show the methodyou processed the data which I can't find some detail about it in your blog. I am appreciated for and looking forward your reply, thanks for that.

ConanCui commented 6 years ago

And I have some doubt about the word embedding too. Is your data big enough so that you don't have to use the pre-train word embedding. Cause I have learned that it is usually that we use the pre-train word embedding to build the language model, and fix the embedding in the training process from deeplearning.ai calss. And I saw that the embedding in your code is initialized randomly and participated in the training, does this difference influence the result.

ryokamoi commented 6 years ago

Hi, Conan

Thank you for your questions.

First, "go_input" function is used in the decoder. It is used as the first input for the decoder. Second, I am planing to publish a new repository with a usage with a open source dataset. Please wait a while. Finally, I agree that the pre-trained word embeddings may improve the result. However, it is sometimes reported that the pre-training of word embeddings does not have a major effect on final results. Thats the only reason I do not do it. It's worth trying it.

ConanCui commented 6 years ago

Hi, Thanks for your detailed reply. I come again to bother you,....eh, for some question about the loss. Now the model is running on both the real/padded tokens that means model will predict labels and compute for loss the padded token as well, could this affect the gradients and the generation result as well ?

ryokamoi commented 6 years ago

Yes, that's true. "PAD" should be ignored, and I've already applied masking in my local repository. However, it is found that masking does not have major influence on the final result.

If you are worried, please add masking or wait for my future commit. Thanks!