Closed wjbianjason closed 7 years ago
Thank you for the interest in our work!
The negative samples are randomly selected from the answer pool, specifically the file "answers.label.token_idx". For each qa pair, we select a fixed number (49 in our experiments) of negative samples. For each epoch, we will re-sample the negative ones.
The negative samples in the test dataset are given according to insuranceQA dataset "https://github.com/shuzi/insuranceQA/tree/master/V1".
Thanks for your relpy, do you try other good/bad rate and get obvious distinction?
my good/bad rate is 1:1,and use different loss to train, one is margin loss(margin=0.05) another is softmax(classify 1,0), but there is huge difference between the scores, margin is 0.656 but softmax is 0.176.
The explain in my mind is that distinction between good pair and random bad pair may be small, so the method regression 1 or 0 may trouble the model and make the param vibration.
This is also why I'm curious about your method to generate bad sample. After your remind, I think the big deal may be the good/bad rate. How do you think that?
Yes, I tried 1:10 before and the performance got worse.
Another thing is that we didn't treat it as a binary classification problem, but something like a 50-class classification problem, similar to margin loss. You may refer to Eqn.(10) in this paper for more details.
Thanks for your knowledge. To be honest, I think the highest score is 0.71 before read your paper.When I find your paper have improved the score to 0.77, I'm hopeless. But I appreciete and admire your work sincerely, Best wishes to you!
@shuohangwang sorry to bother you again, I notice your model in snli and wikiqa different, especially the model structure and loss(one is cross entropy another is kl). To be honest, I inplement your wikiqa model in tensorflow to deal with insuranceQA but can't get the score same with your paper, so I suspect there is also different in insuranceQA model, Therefore ,could you please share your insuranceQA code or the detail in model structure and param set
The structure is same with WikiQA, but need negative sampling for each iteration, as I mentioned before.
The word embeddings are initialized from GloVe. During training, they are not updated. The word embeddings not found in GloVe are initialized with zero.
The dimensionality l of the hidden layers is set to be 150. We use ADAMAX (Kingma & Ba, 2015) with the coefficients β1 = 0.9 and β2 = 0.999 to optimize the model. The batch size is set to be 30 and the learning rate is 0.002. We do not use L2-regularization. The hyper-parameter we tuned is the dropout on the embedding layer. The convolutional windows we used are [1,2,3,4,5].
Please let me know if you couldn't get it! Thanks!
In my mind, the method that you predict the correct answer is to classify right or wrong probability in paper 《A COMPARE-AGGREGATE MODEL FOR MATCHING TEXT SEQUENCES》. I know how to generate the correct label pair sample, but can't ensure how to generate bad sample in your paper. In tranditional qa system, we random a answer from set as the bad answer, but I think this method only for margin loss not suitable for cross entropy. So please share your method to generate bad sample, thanks