Open hufuss opened 6 years ago
@hufuss I get similar results using paragram:
It is really frustrating because the results are not even close to the ones reported in the ACL paper. I wonder if you find any ways to improve the results now?
@nmrksic I wonder if you could let us know how to reproduce the numbers in your ACL paper? It seems to me that some hyper-parameters you specified in the code (for example, batch_size=512) are not consistent to what has been described in your 2 publications [1] and [2], where you have explicitly specified size_size = 256. I am not sure if there are other inconsistencies, because not all parameters are described in these publications. Would you give us suggestions to reproduce the 81.8% reported in the ACL paper? Thank you!
[1] https://arxiv.org/pdf/1805.11350.pdf [2] https://arxiv.org/pdf/1606.03777.pdf
Thanks for pointing this out, and sorry for leaving the codebase in such poor shape.
I ran a small experiment (reported in Footnote 2 in the ACL 2018 paper), and left the code change there. Hence, all along, what was trained was the baseline model, which has very weak performance (that your experiments are reproducing). Updated the code back, testing to see if the performance is back.
My Cambridge account was killed after I left, so the vectors were gone. Moved the Paragram download link to DropBox, which should persist indefinitely.
Batch size - this was part of the same changes that I made to run the baseline. One thing worth pointing out - batch size has always had a very minor impact on performance, but can impact training times (also has to do with GPU used).
As for requests, the second NBT paper doesn't investigate or report them. This code evolved from the previous version of the model (ACL 2017). Removing the print statement for requests.
We will run the updated codebase to make sure everything is back in shape. Leaving the issue open until we're sure that's the case.
Again, thanks for your interest in the NBT, and apologies for the inconvenience!
Nikola
@nmrksic Thanks so much!!
Great! Thanks @nmrksic for your fast response!
Performance seems to be back up (note that runs are stochastic, and ACL 2018 paper reports average of four runs). Closing issue, thanks for letting me know about this problem, and let me know if you find any more.
@nmrksic thanks for revising the code.
I run the revised code and the joint-goals performance is now close to what is reported in your ACL2018 paper. However, the new output (see below) does not produce the "request" performance.
REQ - slot food 0.99 0.86 0.92
REQ - slot area 0.995 0.948 0.971
REQ - slot price range 0.983 0.953 0.968
{
"food": 0.86,
"joint": 0.835,
"price range": 0.953,
"area": 0.948
}
I've noticed that you added the following 2 lines of code in the evaluate_woz()
function
if "request" in slot_gj:
del slot_gj["request"]
Commenting these lines I get an output for the "request" but this is equal to zero.
REQ - slot food 0.99 0.86 0.92
REQ - slot area 0.995 0.948 0.971
REQ - slot price range 0.983 0.953 0.968
{
"food": 0.86,
"joint": 0.835,
"price range": 0.953,
"request": 0.0,
"area": 0.948
}
Can you please advise how I can get a valid performance for the "request"? In the rule-based NBT tracker (http://aclweb.org/anthology/P17-1163) the "request" performance is 91.6% and I'm interested to see what is the performance on the fully statistical NBT.
Thanks!
@nmrksic hi Could you please provide the DSTC2 data and dict mentioned in the parper neural belief tracker? the link in the origin paper is out of service(mi.eng.cam.ac.uk/~nm480/dstc2-clean.zip ) THX a lot
@hufuss Can you please advise how I can get a valid performance for the "request"? Thanks
I have modified the config file to use the glove 300-dimensional embedding as inputs (since the prefix_paragram.txt is missing). However, I get the following error related to the 'request' slot.
It seems that the 'request' is missing from the glove embeddings. I can bypass this by initializing the 'request' slot as follows:
I am not sure if this is correct though; training runs ok but the results during testing is very low compared with the ones presented in the paper and the performance on the 'request' slot is 0.0. Also, the performance on joint goals is 22.3% way far than 81.8% reported in the ACL paper. See below my test output on the woz data:
Can you please provide more information on how to run this and get the numbers in the paper?