Why Is kill_edit=True for Google Corpus Config, but not For Yelp?

kelvinguu / neural-editor

Repository for "Generating Sentences by Editing Prototypes"

328 stars 61 forks source link

Why Is kill_edit=True for Google Corpus Config, but not For Yelp? #2

Closed NickShahML closed 7 years ago

NickShahML commented 7 years ago

Hey @kelvinguu, big thanks for uploading this code. After reviewing the raw code, I'm confused as to why you have kill_edit flag to True in the Google Corpus Config, but False in Yelp.

In kill_edit, we set the entire set of edit vectors to zero which essentially doesn't allow any edit vector to be used during training.

Is the reason why you set this to True in the Google corpus is because the phrases in the Google corpus have many more edits compared to the Yelp Corpus?

Also, do you know when the data dir will be uploaded so we can test Neural Editor? Thanks!

thashim commented 7 years ago

By google corpus im guessing you mean the billion word benchmark?

We didn't include the billion word benchmark with VAE as a default since the model is pretty uninteresting. The billion word benchmark is only included in the paper as an additional perplexity experiment, and the VAE version (kappa=25 in table 1) does strictly worse than the straight seq2seq model (kappa=0) probably because there's far fewer edit pairs in billion word compared to yelp, so its harder to learn a good encoder/decoder pair.

We'll upload the data + corpus we used + LSH code sometime this weekend to next week.

NickShahML commented 7 years ago

Thanks @thashim . Yes I was referring to the Google 1 billion corpus

It makes sense to me that the 1 billion corpus is just a plain vanilla seq2seq model (kappa=0). There would be so few edit pairs that have a Jaccob distance of <0.5. Therefore, it would be much harder to learn how to appropriately handle the edit vector.

Looking forward to the LSH code. Thanks again for uploading this.