nmrksic / neural-belief-tracker

Fully Statistical Neural Belief Tracker (Mrkšić and Vulić, ACL 2018)
Apache License 2.0
169 stars 61 forks source link

Issue with running with glove embeddings and test results #2

Open hufuss opened 6 years ago

hufuss commented 6 years ago

I have modified the config file to use the glove 300-dimensional embedding as inputs (since the prefix_paragram.txt is missing). However, I get the following error related to the 'request' slot.

Traceback (most recent call last):
  File "code/nbt.py", line 1950, in <module>
    main()              
  File "code/nbt.py", line 1897, in main
    NBT = NeuralBeliefTracker(config_filepath)
  File "code/nbt.py", line 1751, in __init__
    slot_vectors[value_idx, :] = word_vectors[slot]
KeyError: u'request'

It seems that the 'request' is missing from the glove embeddings. I can bypass this by initializing the 'request' slot as follows:


                slot_vectors = numpy.zeros((len(dialogue_ontology[slot]), 300), dtype="float32")
                value_vectors = numpy.zeros((len(dialogue_ontology[slot]), 300), dtype="float32")

                # START of added part to handle the request slot 
                word_vectors[unicode(slot)] = xavier_vector(unicode(slot))
                print "-- Generating word vector for:", value.encode("utf-8"), ":::", numpy.sum(word_vectors[value])
                # END of added part to handle the request slot 

                for value_idx, value in enumerate(dialogue_ontology[slot]):

I am not sure if this is correct though; training runs ok but the results during testing is very low compared with the ones presented in the paper and the performance on the 'request' slot is 0.0. Also, the performance on joint goals is 22.3% way far than 81.8% reported in the ACL paper. See below my test output on the woz data:

WOZ evaluation using language: english en
----------- Loading Model ./models/CNN_en_False_woz_food_woz_stat_update_1.0.ckpt  ----------------
----------- Loading Model ./models/CNN_en_False_woz_price range_woz_stat_update_1.0.ckpt  ----------------
----------- Loading Model ./models/CNN_en_False_woz_area_woz_stat_update_1.0.ckpt  ----------------
----------- Loading Model ./models/CNN_en_False_woz_request_woz_stat_update_1.0.ckpt  ----------------
0 / 400 done.
100 / 400 done.
200 / 400 done.
300 / 400 done.
REQ - slot food 0.977 0.27 0.423
REQ - slot area 0.937 0.378 0.538
REQ - slot price range 0.947 0.434 0.595
{
    "food": 0.27, 
    "joint": 0.223, 
    "price range": 0.434, 
    "request": 0.0, 
    "area": 0.378
}

Can you please provide more information on how to run this and get the numbers in the paper?

liutianlin0121 commented 6 years ago

@hufuss I get similar results using paragram:

orig_paragram_result

It is really frustrating because the results are not even close to the ones reported in the ACL paper. I wonder if you find any ways to improve the results now?

@nmrksic I wonder if you could let us know how to reproduce the numbers in your ACL paper? It seems to me that some hyper-parameters you specified in the code (for example, batch_size=512) are not consistent to what has been described in your 2 publications [1] and [2], where you have explicitly specified size_size = 256. I am not sure if there are other inconsistencies, because not all parameters are described in these publications. Would you give us suggestions to reproduce the 81.8% reported in the ACL paper? Thank you!

[1] https://arxiv.org/pdf/1805.11350.pdf [2] https://arxiv.org/pdf/1606.03777.pdf

nmrksic commented 6 years ago

Thanks for pointing this out, and sorry for leaving the codebase in such poor shape.

  1. I ran a small experiment (reported in Footnote 2 in the ACL 2018 paper), and left the code change there. Hence, all along, what was trained was the baseline model, which has very weak performance (that your experiments are reproducing). Updated the code back, testing to see if the performance is back.

  2. My Cambridge account was killed after I left, so the vectors were gone. Moved the Paragram download link to DropBox, which should persist indefinitely.

  3. Batch size - this was part of the same changes that I made to run the baseline. One thing worth pointing out - batch size has always had a very minor impact on performance, but can impact training times (also has to do with GPU used).

  4. As for requests, the second NBT paper doesn't investigate or report them. This code evolved from the previous version of the model (ACL 2017). Removing the print statement for requests.

We will run the updated codebase to make sure everything is back in shape. Leaving the issue open until we're sure that's the case.

Again, thanks for your interest in the NBT, and apologies for the inconvenience!

Nikola

liutianlin0121 commented 6 years ago

@nmrksic Thanks so much!!

hufuss commented 6 years ago

Great! Thanks @nmrksic for your fast response!

nmrksic commented 6 years ago

Performance seems to be back up (note that runs are stochastic, and ACL 2018 paper reports average of four runs). Closing issue, thanks for letting me know about this problem, and let me know if you find any more.

hufuss commented 6 years ago

@nmrksic thanks for revising the code.

I run the revised code and the joint-goals performance is now close to what is reported in your ACL2018 paper. However, the new output (see below) does not produce the "request" performance.

REQ - slot food 0.99 0.86 0.92
REQ - slot area 0.995 0.948 0.971
REQ - slot price range 0.983 0.953 0.968
{
    "food": 0.86, 
    "joint": 0.835, 
    "price range": 0.953, 
    "area": 0.948
}

I've noticed that you added the following 2 lines of code in the evaluate_woz() function

 if "request" in slot_gj:
        del slot_gj["request"]

Commenting these lines I get an output for the "request" but this is equal to zero.

REQ - slot food 0.99 0.86 0.92
REQ - slot area 0.995 0.948 0.971
REQ - slot price range 0.983 0.953 0.968
{
    "food": 0.86, 
    "joint": 0.835, 
    "price range": 0.953, 
    "request": 0.0, 
    "area": 0.948
}

Can you please advise how I can get a valid performance for the "request"? In the rule-based NBT tracker (http://aclweb.org/anthology/P17-1163) the "request" performance is 91.6% and I'm interested to see what is the performance on the fully statistical NBT.

Thanks!

hydercps commented 5 years ago

@nmrksic hi Could you please provide the DSTC2 data and dict mentioned in the parper neural belief tracker? the link in the origin paper is out of service(mi.eng.cam.ac.uk/~nm480/dstc2-clean.zip ) THX a lot

sc1054 commented 4 years ago

@hufuss Can you please advise how I can get a valid performance for the "request"? Thanks