Open katesanders9 opened 11 months ago
When split into three separate prompts and provided with the entire transcript each time, ChatGPT is able to generate NLI hypotheses and correctly label them with judgments and evidence.
DIALOGUE:
ROSS: Hey hey, we'll be fine. We're just gonna need a little time.
MONICA: I understand.
ALAN: Wow.
MONICA: I'm, I'm really sorry.
ALAN: Yeah, I'm sorry too. But, I gotta tell you, I am a little relieved.
MONICA: Relieved?
ALAN: Yeah, well, I had a great time with you.. I just can't stand your friends.
RACHEL: Remember when we went to Central Park and rented boats?.. That was fun.
ROSS: Yeah. He could row like a viking.
MONICA: Hi.
ALL: Mmm.
ROSS: So how'd it go?
MONICA: Oh, y'know..
PHOEBE: Did he mention us?
MONICA: He said he's really gonna miss you guys.
ROSS: You had a rough day, huh.. c'mere.
CHANDLER: ...That's it. I'm getting cigarettes.
ALL: No no no!
CHANDLER: I don't care, I don't care! Game's over! I'm weak! I've gotta smoke! I've gotta have the smoke!
PHOEBE: If you never smoke again I'll give you seven thousand dollars!
CHANDLER: Yeah, alright.
PROMPT:
Write a set of 9 natural language inference (NLI) test hypotheses using this transcript as the context. 3 of the hypotheses should be ENTAILED by the transcript, while 3 should be directly CONTRADICTED, and the last 3 should be NEITHER directly entailed nor contradicted. Your output format is a serialized json item on a single line with the following format: {"ENTAILMENT": ["<hypothesis>", ...], "CONTRADICTION": ["<hypothesis>"...], "NEITHER": ["<hypothesis>",...]} and nothing else.
OUTPUT:
{"ENTAILMENT": [
"Monica feels sorry for something that happened.",
"Alan had a good time with Monica and her friends.",
"Chandler promised not to smoke again after Phoebe's offer."
],
"CONTRADICTION": [
"Ross enjoyed rowing the boats in Central Park.",
"Alan likes all of Monica's friends.",
"Chandler successfully quit smoking."
],
"NEITHER": [
"Ross and Monica rented boats at the beach.",
"Phoebe is disappointed in Chandler's decision.",
"Rachel dislikes Central Park."
]
}
The correct labels for these hypotheses (I think?) are:
E - "Monica feels sorry for something that happened.",
E - "Alan had a good time with Monica and her friends.",
E - "Chandler promised not to smoke again after Phoebe's offer."
E - "Ross enjoyed rowing the boats in Central Park.",
C - "Alan likes all of Monica's friends.",
N - "Chandler successfully quit smoking."
C - "Ross and Monica rented boats at the beach.",
N - "Phoebe is disappointed in Chandler's decision.",
C - "Rachel dislikes Central Park."
SECOND PASS PROMPT:
Given the provided transcript, label each hypothesis with one of the following natural language inference judgments: ENTAILMENT, CONTRADICTION, or NEITHER. Your output format is a serialized json item on a single line with the following format: {"<hypothesis>": "<judgment>", "<hypothesis>": "<judgment>", ...} and nothing else.
OUTPUT:
{"Monica feels sorry for something that happened.": "ENTAILMENT",
"Alan had a good time with Monica and her friends.": "ENTAILMENT",
"Chandler promised not to smoke again after Phoebe's offer.": "ENTAILMENT",
"Ross enjoyed rowing the boats in Central Park.": "ENTAILMENT",
"Alan likes all of Monica's friends.": "CONTRADICTION",
"Chandler successfully quit smoking.": "NEITHER",
"Ross and Monica rented boats at the beach.": "CONTRADICTION",
"Phoebe is disappointed in Chandler's decision.": "NEITHER",
"Rachel dislikes Central Park.": "CONTRADICTION"}
Note: These have to be separated into "entailment" and "contradiction" prompts for the outputs to be appropriate.
THIRD PASS (ENTAILMENT) PROMPT:
Given the provided transcript, label each hypothesis with the transcript lines that make it true. Your output format is a serialized json item on a single line with the following format: {"<hypothesis>": ["<line 1>", "<line 2>", ... ], "<hypothesis>": ["<line 1>", "<line 2>", ... ], ...} and nothing else.
OUTPUT:
{
"Monica feels sorry for something that happened.": ["MONICA: I'm, I'm really sorry."],
"Alan had a good time with Monica and her friends.": ["ALAN: Yeah, well, I had a great time with you.. I just can't stand your friends."],
"Chandler promised not to smoke again after Phoebe's offer.": ["CHANDLER: If you never smoke again I'll give you seven thousand dollars!", "CHANDLER: Yeah, alright."],
"Ross enjoyed rowing the boats in Central Park.": ["ROSS: Yeah. He could row like a viking."]
}
THIRD PASS (CONTRADICTION) PROMPT:
Given the provided transcript, label each hypothesis with the transcript lines that make it false. Your output format is a serialized json item on a single line with the following format: {"<hypothesis>": ["<line 1>", "<line 2>", ... ], "<hypothesis>": ["<line 1>", "<line 2>", ... ], ...} and nothing else.
OUTPUT:
{
"Alan likes all of Monica's friends.": ["Yeah, well, I had a great time with you.. I just can't stand your friends."],
"Ross and Monica rented boats at the beach.": ["Remember when we went to Central Park and rented boats?.. That was fun."],
"Rachel dislikes Central Park.": ["Remember when we went to Central Park and rented boats?.. That was fun."]
}
(Labels after second pass) Even number (6 each) after first pass
42% entailments, 33% neither, 25% contradictions | E | N | C | |
---|---|---|---|---|
Friends | 10 | 4 | 4 | |
TBBT | 7 | 7 | 4 | |
HIMYM | 5 | 8 | 5 | |
Greys | 7 | 6 | 5 | |
House | 10 | 4 | 4 | |
Castle | 6 | 7 | 5 |
Friends | TBBT | HIMYM | Greys | House | Castle | Average | |
---|---|---|---|---|---|---|---|
First pass | .61 | 1.0 | .89 | .78 | .61 | .94 | .81 |
Second pass | .83 | .78 | .89 | .83 | .83 | .61 | .80 |
Vertical = GT label, Horizontal = Pred label First pass: | E | N | C | |
---|---|---|---|---|
E | 32 | 3 | 2 | |
N | 2 | 24 | 3 | |
C | 2 | 9 | 31 |
Second pass: | E | N | C | |
---|---|---|---|---|
E | 35 | 2 | 0 | |
N | 2 | 25 | 2 | |
C | 8 | 8 | 27 |
Generally, the model does poorly with contradictions (74%, 63%) compared to entailments (86%, 95%) and neithers (82%, 86%). Trying to rephrase the second-pass prompt to improve scores here.
Asking "are you sure?" (and variants) after initial prompt causes TBBT performance to drop slightly (e.g., from 78% to 72%).
For the TBBT transcripts, both 3-way and 5-way voting produced the same overall judgments. Mild variance among a few hypotheses (3 total), but generally the judgments were the same each time.
Textual entailment is defined as a directional relation between two text fragments, called text (t, the entailing text), and hypothesis (h, the entailed text), so that a human being, with common understanding of language and common background knowledge, can infer that h is most likely true on the basis of the content of t. Textual contradiction is a directional relation between two text fragments such that a human being can infer that h is most likely false on the basis of the content of t.
Prepending this to the TBBT experiments improves second-pass accuracy from 78% to 89%. Prepending this to the Castle experiments improves second-pass accuracy from 61% to 83%.
All other prepended definitions did not achieve as high performance on these two data subsets.
Friends | TBBT | HIMYM | Greys | House | Castle | Average | |
---|---|---|---|---|---|---|---|
First pass | .61 | 1.0 | .89 | .78 | .61 | .94 | .81 |
Second pass | .83 | .78 | .89 | .83 | .83 | .61 | .80 |
New second pass | .83 | .83 | .78 | .83 | .83 | .78 | .81 |
New second pass confusion matrix: Vertical is GT, horizontal is pred | E | N | C | |
---|---|---|---|---|
E | 37 | 0 | 0 | |
N | 3 | 24 | 2 | |
C | 7 | 8 | 27 |
TP | FP | FN | TN | |
---|---|---|---|---|
First pass | 32 | 4 | 5 | 62 |
Second pass | 35 | 10 | 2 | 59 |
New second pass | 37 | 10 | 0 | 61 |
Checking to see if a binary classification prompt improves the false-positive rate
Adding E/C/N definition to start of first-pass prompt.
Friends | TBBT | HIMYM | Greys | House | Castle | Average | |
---|---|---|---|---|---|---|---|
First pass | .61 | 1.0 | .89 | .78 | .61 | .94 | .81 |
Second pass | .83 | .78 | .89 | .83 | .83 | .61 | .80 |
New second pass | .83 | .83 | .78 | .83 | .83 | .78 | .81 |
New first pass | .78 | 1.0 | .89 | .83 | .94 | .83 | .88 |
New second pass confusion matrix: Vertical is GT, horizontal is pred | E | N | C | |
---|---|---|---|---|
E | 34 | 1 | 1 | |
N | 2 | 25 | 9 | |
C | 1 | 0 | 35 |
False positives/negatives | TP | FP | FN | |
---|---|---|---|---|
First pass | 32 | 4 | 5 | |
Second pass | 35 | 10 | 2 | |
New second pass | 37 | 10 | 0 | |
New first pass | 34 | 3 | 2 |
Using ChatGPT (3.5):
Estimated dialogue tokens per transcript: ~320
Rest of first prompt: 251
Output: ~120
Cost per 9 hypotheses: 0.002/1000*(320+251+120)=0.001382
Cost for 100,000 hypotheses: 0.001382*100000/9=15.36
(Need 12,000 clips (TVQA has 21,800 clips total))
Using ChatGPT (3.5):
Estimated dialogue tokens per transcript: ~320
Rest of first prompt: 212
Output: ~136
Cost per 9 hypotheses: 0.002/1000*(320+212+136)=0.001336
Cost for 100,000 hypotheses: 0.001336*100000/9=14.08
For 100,000 hypotheses using ChatGPT: 14.08+15.36=$29.44
7-24 meeting notes
(transcript, hypothesis, judgment, (start-line, end-line))
start-line
andend-line
can shorten the transcript if it is longer than the context window of an encoder (these can be selected via GPT)