Closed NickShahML closed 7 years ago
By google corpus im guessing you mean the billion word benchmark?
We didn't include the billion word benchmark with VAE as a default since the model is pretty uninteresting. The billion word benchmark is only included in the paper as an additional perplexity experiment, and the VAE version (kappa=25 in table 1) does strictly worse than the straight seq2seq model (kappa=0) probably because there's far fewer edit pairs in billion word compared to yelp, so its harder to learn a good encoder/decoder pair.
We'll upload the data + corpus we used + LSH code sometime this weekend to next week.
Thanks @thashim . Yes I was referring to the Google 1 billion corpus
It makes sense to me that the 1 billion corpus is just a plain vanilla seq2seq model (kappa=0). There would be so few edit pairs that have a Jaccob distance of <0.5. Therefore, it would be much harder to learn how to appropriately handle the edit vector.
Looking forward to the LSH code. Thanks again for uploading this.
Hey @kelvinguu, big thanks for uploading this code. After reviewing the raw code, I'm confused as to why you have
kill_edit
flag to True in the Google Corpus Config, but False in Yelp.In
kill_edit
, we set the entire set of edit vectors to zero which essentially doesn't allow any edit vector to be used during training.Is the reason why you set this to True in the Google corpus is because the phrases in the Google corpus have many more edits compared to the Yelp Corpus?
Also, do you know when the
data
dir will be uploaded so we can test Neural Editor? Thanks!