D3ST leave-one-out setting outputs descriptions from the left-out domain

WeixuanZ commented 1 year ago

I'm wondering why the descriptions of slots from all domains are included in the prompt when training in the leave-one-out setting: https://github.com/google-research/task-oriented-dialogue/blob/c19be7593fc67299c8ee15634f47bfba452d970d/state_tracking/d3st/create_multiwoz_schemaless_data.py#L126-L186

(The blocked_domains argument seems to only skip turns entirely, but not affect prompts in other turns https://github.com/google-research/task-oriented-dialogue/blob/c19be7593fc67299c8ee15634f47bfba452d970d/state_tracking/d3st/create_multiwoz_schemaless_data.py#L246-L247

Using the following args inferred from the D3ST paper:

            "args": [
                "--multiwoz_dir", "data/raw/multiwoz/",
                "--output_dir", "tmp/",
                "--schema_file", "data/raw/multiwoz/schema.json",
                "--multiwoz_version", "2.4",
                "--description_type", "full_desc_with_domain",
                "--delimiter", "=",
                "--multiple_choice", "1a",
                "--blocked_domains", "hotel"
            ],

an example output for pmul1181.json turn 1 is

'0=attraction-name of the attraction 1=restaurant-price budget for the restaurant 1a) expensive 1b) cheap 1c) moderate 2=train-number of people booking for train 3=hotel-parking facility at the hotel 3a) yes 3b) free 3c) no 4=hotel-what is the type of the hotel 4a) guesthouse 4b) hotel 5=train-leaving time for the train 6=restaurant-time of the restaurant booking 7=train-day of the train 7a) sunday 7b) wednesday 7c) saturday 7d) tuesday 7e) thursday 7f) friday 7g) monday 8=restaurant-number of people booking the restaurant 9=taxi-leaving time of taxi 10=bus-destination of taxi 11=bus-day to use the bus tickets 11a) wednesday 12=train-destination of the train 12a) cambridge 12b) norwich 12c) london liverpool street 12d) kings lynn 12e) stansted airport 12f) peterborough 12g) ely 12h) birmingham new street 12i) broxbourne 12j) leicester 12k) bishops stortford 12l) stevenage 12m) london kings cross 13=restaurant-day of the restaurant booking 13a) tuesday 13b) saturday 13c) friday 13d) thursday 13e) monday 13f) wednesday 13g) sunday 14=restaurant-name of the restaurant 15=attraction-type of the attraction 15a) multiple sports 15b) park 15c) nightclub 15d) theatre 15e) cinema 15f) entertainment 15g) college 15h) boat 15i) swimmingpool 15j) museum 15k) concerthall 15l) architecture 16=hotel-number of people for the hotel booking 17=taxi-destination of taxi 18=hotel-star rating of the hotel 19=restaurant-area or place of the restaurant 19a) east 19b) centre 19c) north 19d) west 19e) south 20=taxi-arrival time of taxi 21=train-arrival time of the train 22=bus-leaving time of bus 23=hospital-name of hospital department 24=hotel-area or place of the hotel 24a) south 24b) north 24c) east 24d) centre 24e) west 25=hotel-internet option at the hotel 25a) no 25b) yes 25c) free 26=bus-departure location of bus 27=hotel-name of the hotel 28=attraction-area or place of the attraction 28a) north 28b) west 28c) south 28d) east 28e) centre 29=hotel-length of stay at the hotel 30=hotel-day of the hotel booking 30a) tuesday 30b) sunday 30c) monday 30d) saturday 30e) thursday 30f) friday 30g) wednesday 31=taxi-departure location of taxi 32=hotel-price budget of the hotel 32a) expensive 32b) cheap 32c) moderate 33=restaurant-food type for the restaurant 34=train-departure location of the train 34a) peterborough 34b) london liverpool street 34c) bishops stortford 34d) london kings cross 34e) broxbourne 34f) ely 34g) stevenage 34h) norwich 34i) birmingham new street 34j) kings lynn 34k) stansted airport 34l) leicester 34m) cambridge [user] howdy, i need a train heading into cambridge.'

Notice that slots descriptions from the hotel domain are included.

In my opinion, it's more logical not to train on these schema descriptions, since they wouldn't be available if a new service is introduced in the real world. Would appreciate some insight on this.

descrip commented 1 year ago

Oops, this is an issue with the data we generated --- good catch. I definitely agree that the domain schema descriptions should be removed if that domain is blocked. I suspect that this might be hurting our performance on the leave-one-domain-out setting, since the model may learn that these slots should have empty prediction.

alexcoca commented 1 year ago

@descrip thanks so much for your reply!

Me and @WeixuanZ successfully replicated the D3ST (T5-base) on MultiWOZ 2.4 (Table 1a). However, we cannot replicate results in Table 4a. The paper references works that do cross-domain evaluation but are iterative. It does not detail the prefix structure in training/evaluation/testing and we believe our inability to match your results arises from such differences and diffrences in the metrics computation. We aim to elucidate them below.

For the sake of simplicity, let's assume that the entire MultiWOZ ontology contains 24 slots. We train on 4 domains, which we assume have 19 slots. The 5th domain, left out, has 5 slots.

Training logic:

Prefix, call it A, contains always 19 slots, indexed 0 - 18. We fixed the bug mentioned in this issue but otherwise used your pre-processing code to train our model with this prefix. Thus, we ignore the turns whose belief state annotations contain slots from the unseen domain.

Test logic:

There are two prefixes we can use:

B. The prefix is the entire ontology. This means that instead of 19, we have 24 slots in the prefix, indexed 0 - 23. The same prefix was used in training, eval & testing of the baseline model (Table 1a)

C. Prefix is a concatenation of unseen descriptions alone. This means we have 5 slots, indexed 0 - 4, achieved by blocking all domains except the left-out domain using your preprocessing script. This is similar to the logic used by Lin et al., which you cite.

Metrics computation

Two metrics of interest can be:

(1) joint accuracy with respect to all domains that were seen in training (2) joint accuracy with respect to the unseen domain

For dialogues where both seen and unseen domains occur, some of the belief sequences will contain predictions from both seen and unseen domains:

To obtain (1), we first we could either use prefix A or B. If we use prefix B, we ignore all slots from the unseen domain and compare the joint predictions for the seen domains alone: all slots from all seen domains need to be correct for the turn to contribute to an increase of the JGA.
To obtain (2), if we use prefix B, we ignore any predictions from seen domains.

If we prompt with prefix C, then we run into the issue of whether or not to consider all turns in the test set or only those who have belief annotations from the unseen domain. The JGA will increase if we use prefix C if the model predicts an empty state for turns where belief annotations contain only seen domains. This does indicate that the model recognises when the prefix domain is unrelated to the dialogue domain, but is not so much indicative of its state tracking performance.

Our implementation of the evaluation logic can be found at Tomiinek/MultiWOZ_Evaluation#18.

Our intuition is that:

you trained with a prefix containing the description of all seen domains (A above)
in testing, you prompted with prefix C and reported (2) in Table 4a. We are unsure whether you made predictions for a) all testing examples b) only examples that contain the unseen domain in the annotations. I suppose b) (aka you blocked all domains except the unseen domain during testing)

descrip commented 1 year ago

Hi Alex and Weixuan,

Thanks for your guys' work on reproducing these results, this is very detailed. I took a look at the generated data we used to evaluate on the cross domain setting, and the prefix is C for both training and testing. During training, it looks like we only use the domains that are active for the current turn, and during testing we only eval on turns that have the left-out domain active. I copied some examples from the data below:

E.g. in the cross-domain setting, leaving out the train domain. A training example (for restaurant):

"0:restaurant-area or place of the restaurant 0a) west 0b) north 0c) east 0d) south 0e) centre 1:restaurant-number of people booking the restaurant 2:restaurant-time of the restaurant booking 3:attraction-name of the attraction 4:restaurant-price budget for the restaurant 4a) cheap 4b) expensive 4c) moderate 5:restaurant-name of the restaurant 6:attraction-area or place of the attraction 6a) east 6b) west 6c) centre 6d) south 6e) north 7:restaurant-day of the restaurant booking 7a) saturday 7b) tuesday 7c) wednesday 7d) monday 7e) friday 7f) thursday 7g) sunday 8:restaurant-food type for the restaurant 9:attraction-type of the attraction [user] i am looking for something to do in cambridge . i am interested in architecture , is there any interesting buildings around the city centre ? [system] i think i can help you with that . there are several historic colleges and churches to see . many offer free admission . [user] i would like to take a look at some of the older churches , what are some of the entrance fees for those ? [system] the churches are free . [user] please give me their phone number . i also need a restaurant that serves north american food [system] here are the numbers for 2 churches , all saints church01223452587 , holy trinity church01223355397 . there are 3 more if needed . a good restaurant is gourmet burger kitchen . [user] thank you goodbye [system] thank you for using our services ! [user] oh , wait ! if the gourmet burger kitchen is in the same area as the churches , could i get the phone number for that , too ? [system] certainly ! the phone number is: 01223312598 is there anything else we can assist with today ? [user] yes may i please have the address to gourmet burger kitchen and can you book reservations for 2 for 13:00

[states] 1:2 2:13:00 5:gourmet burger kitchen 6:6c 8:north american 9:architecture [intents] [req_slots]

A corresponding evaluation example on the train domain:

0:train-arrival time of the train 1:train-departure location of the train 2:train-number of people booking for train 3:train-leaving time for the train 4:train-day of the train 4a) wednesday 4b) friday 4c) sunday 4d) tuesday 4e) saturday 4f) thursday 4g) monday 5:train-destination of the train [user] yes , i need assistance with finding a train for my trip . [system] there are many options available to you . what destination did you have in mind ? [user] i need to get to cambridge and i'll be departing from ely on monday . [system] we have many options available to you . is there a certain time you are wanting to leave ? [user] i would like to leave any time after 9:15 on monday . [system] how about leaving at 9:35 ? [user] that would work , can i get the travel time on that ?

[states] 1:ely 3:09:15 4:4g 5:cambridge [intents] [req_slots]

It's possible that this is not what the checked-in data scripts do --- we may have hacked this in order to be in line with Lin et al.. Actually, these data scripts were refactored before we checked them in, and it's also possible a bug was introduced. But I do remember that we paid special attention to make sure our setup was the same as Lin et al..

Please let me know if you're able to reproduce Table 4a with this setup. We also modified the JGA calculation for the train domain --- please see footnote 6. We used T5 1.1 Large. I'll also check with the person who ran these experiments to make sure.

alexcoca commented 1 year ago

Hi @descrip, thank you so much. We will run these experiments and report back. For fair comparison with other works we will use T5 1.1 base (and not large).

google-research / task-oriented-dialogue

D3ST leave-one-out setting outputs descriptions from the left-out domain #9