katesanders9 commented 11 months ago

7-24 meeting notes

Working with the entire transcript at each recursive level will likely improve accuracy as well as interpretability
Preliminary silver-annotated datapoint: (transcript, hypothesis, judgment, (start-line, end-line))
start-line and end-line can shorten the transcript if it is longer than the context window of an encoder (these can be selected via GPT)
Silver data could also be made from unlabeled transcripts to improve domain range, either now or later on

katesanders9 commented 11 months ago

Summary

When split into three separate prompts and provided with the entire transcript each time, ChatGPT is able to generate NLI hypotheses and correctly label them with judgments and evidence.

Data generation example

DIALOGUE:

ROSS: Hey hey, we'll be fine. We're just gonna need a little time.
MONICA: I understand.
ALAN: Wow.
MONICA: I'm, I'm really sorry.
ALAN: Yeah, I'm sorry too. But, I gotta tell you, I am a little relieved.
MONICA: Relieved?
ALAN: Yeah, well, I had a great time with you.. I just can't stand your friends.
RACHEL: Remember when we went to Central Park and rented boats?.. That was fun.
ROSS: Yeah. He could row like a viking.
MONICA: Hi.
ALL: Mmm.
ROSS: So how'd it go?
MONICA: Oh, y'know..
PHOEBE: Did he mention us?
MONICA: He said he's really gonna miss you guys.
ROSS: You had a rough day, huh.. c'mere.
CHANDLER: ...That's it. I'm getting cigarettes.
ALL: No no no!
CHANDLER: I don't care, I don't care! Game's over! I'm weak! I've gotta smoke! I've gotta have the smoke!
PHOEBE: If you never smoke again I'll give you seven thousand dollars!
CHANDLER: Yeah, alright.

PROMPT:

Write a set of 9 natural language inference (NLI) test hypotheses using this transcript as the context. 3 of the hypotheses should be ENTAILED by the transcript, while 3 should be directly CONTRADICTED, and the last 3 should be NEITHER directly entailed nor contradicted. Your output format is a serialized json item on a single line with the following format: {"ENTAILMENT": ["<hypothesis>", ...], "CONTRADICTION": ["<hypothesis>"...], "NEITHER": ["<hypothesis>",...]} and nothing else.

OUTPUT:

{"ENTAILMENT": [
"Monica feels sorry for something that happened.",
"Alan had a good time with Monica and her friends.",
"Chandler promised not to smoke again after Phoebe's offer."
],
"CONTRADICTION": [
"Ross enjoyed rowing the boats in Central Park.",
"Alan likes all of Monica's friends.",
"Chandler successfully quit smoking."
],
"NEITHER": [
"Ross and Monica rented boats at the beach.",
"Phoebe is disappointed in Chandler's decision.",
"Rachel dislikes Central Park."
]
}

The correct labels for these hypotheses (I think?) are:

E - "Monica feels sorry for something that happened.",
E - "Alan had a good time with Monica and her friends.",
E - "Chandler promised not to smoke again after Phoebe's offer."

E - "Ross enjoyed rowing the boats in Central Park.",
C - "Alan likes all of Monica's friends.",
N - "Chandler successfully quit smoking."

C - "Ross and Monica rented boats at the beach.",
N - "Phoebe is disappointed in Chandler's decision.",
C - "Rachel dislikes Central Park."

SECOND PASS PROMPT: Given the provided transcript, label each hypothesis with one of the following natural language inference judgments: ENTAILMENT, CONTRADICTION, or NEITHER. Your output format is a serialized json item on a single line with the following format: {"<hypothesis>": "<judgment>", "<hypothesis>": "<judgment>", ...} and nothing else.

OUTPUT:

{"Monica feels sorry for something that happened.": "ENTAILMENT",
"Alan had a good time with Monica and her friends.": "ENTAILMENT",
"Chandler promised not to smoke again after Phoebe's offer.": "ENTAILMENT",
"Ross enjoyed rowing the boats in Central Park.": "ENTAILMENT",
"Alan likes all of Monica's friends.": "CONTRADICTION",
"Chandler successfully quit smoking.": "NEITHER",
"Ross and Monica rented boats at the beach.": "CONTRADICTION",
"Phoebe is disappointed in Chandler's decision.": "NEITHER",
"Rachel dislikes Central Park.": "CONTRADICTION"}

Evidence generation prompts

Note: These have to be separated into "entailment" and "contradiction" prompts for the outputs to be appropriate.

THIRD PASS (ENTAILMENT) PROMPT: Given the provided transcript, label each hypothesis with the transcript lines that make it true. Your output format is a serialized json item on a single line with the following format: {"<hypothesis>": ["<line 1>", "<line 2>", ... ], "<hypothesis>": ["<line 1>", "<line 2>", ... ], ...} and nothing else.

OUTPUT:

{
  "Monica feels sorry for something that happened.": ["MONICA: I'm, I'm really sorry."],
  "Alan had a good time with Monica and her friends.": ["ALAN: Yeah, well, I had a great time with you.. I just can't stand your friends."],
  "Chandler promised not to smoke again after Phoebe's offer.": ["CHANDLER: If you never smoke again I'll give you seven thousand dollars!", "CHANDLER: Yeah, alright."],
  "Ross enjoyed rowing the boats in Central Park.": ["ROSS: Yeah. He could row like a viking."]
}

THIRD PASS (CONTRADICTION) PROMPT: Given the provided transcript, label each hypothesis with the transcript lines that make it false. Your output format is a serialized json item on a single line with the following format: {"<hypothesis>": ["<line 1>", "<line 2>", ... ], "<hypothesis>": ["<line 1>", "<line 2>", ... ], ...} and nothing else.

OUTPUT:

{
"Alan likes all of Monica's friends.": ["Yeah, well, I had a great time with you.. I just can't stand your friends."],
"Ross and Monica rented boats at the beach.": ["Remember when we went to Central Park and rented boats?.. That was fun."],
"Rachel dislikes Central Park.": ["Remember when we went to Central Park and rented boats?.. That was fun."]
}

katesanders9 commented 11 months ago

GPT Accuracy across 12 transcripts (108 hypotheses)

12 transcripts (2 per show)
Average of 22.8 lines per transcript
108 hypotheses (9 per transcript)

GT judgment distribution

(Labels after second pass) Even number (6 each) after first pass

42% entailments, 33% neither, 25% contradictions		E	N
Friends	10	4	4
TBBT	7	7	4
HIMYM	5	8	5
Greys	7	6	5
House	10	4	4
Castle	6	7	5

Accuracy by show

	Friends	TBBT	HIMYM	Greys	House	Castle	Average
First pass	.61	1.0	.89	.78	.61	.94	.81
Second pass	.83	.78	.89	.83	.83	.61	.80

Accuracy by GT label

Vertical = GT label, Horizontal = Pred label First pass:		E	N
E	32	3	2
N	2	24	3
C	2	9	31

Second pass:		E	N
E	35	2	0
N	2	25	2
C	8	8	27

katesanders9 commented 11 months ago

Generally, the model does poorly with contradictions (74%, 63%) compared to entailments (86%, 95%) and neithers (82%, 86%). Trying to rephrase the second-pass prompt to improve scores here.

katesanders9 commented 11 months ago

Potential improvements for second pass:

Ask "are you sure?"
Majority votes
Include E/N/C definitions in prompt

Are you sure

Asking "are you sure?" (and variants) after initial prompt causes TBBT performance to drop slightly (e.g., from 78% to 72%).

Majority votes

For the TBBT transcripts, both 3-way and 5-way voting produced the same overall judgments. Mild variance among a few hypotheses (3 total), but generally the judgments were the same each time.

E/N/C definitions

Textual entailment is defined as a directional relation between two text fragments, called text (t, the entailing text), and hypothesis (h, the entailed text), so that a human being, with common understanding of language and common background knowledge, can infer that h is most likely true on the basis of the content of t. Textual contradiction is a directional relation between two text fragments such that a human being can infer that h is most likely false on the basis of the content of t.

Prepending this to the TBBT experiments improves second-pass accuracy from 78% to 89%. Prepending this to the Castle experiments improves second-pass accuracy from 61% to 83%.

All other prepended definitions did not achieve as high performance on these two data subsets.

	Friends	TBBT	HIMYM	Greys	House	Castle	Average
First pass	.61	1.0	.89	.78	.61	.94	.81
Second pass	.83	.78	.89	.83	.83	.61	.80
New second pass	.83	.83	.78	.83	.83	.78	.81

New second pass confusion matrix: Vertical is GT, horizontal is pred		E	N
E	37	0	0
N	3	24	2
C	7	8	27

Comparing false positives/negatives (binary classification)

	TP	FP	FN	TN
First pass	32	4	5	62
Second pass	35	10	2	59
New second pass	37	10	0	61

katesanders9 commented 11 months ago

Checking to see if a binary classification prompt improves the false-positive rate

katesanders9 commented 11 months ago

Adding E/C/N definition to start of first-pass prompt.

	Friends	TBBT	HIMYM	Greys	House	Castle	Average
First pass	.61	1.0	.89	.78	.61	.94	.81
Second pass	.83	.78	.89	.83	.83	.61	.80
New second pass	.83	.83	.78	.83	.83	.78	.81
New first pass	.78	1.0	.89	.83	.94	.83	.88

New second pass confusion matrix: Vertical is GT, horizontal is pred		E	N
E	34	1	1
N	2	25	9
C	1	0	35

False positives/negatives		TP	FP
First pass	32	4	5
Second pass	35	10	2
New second pass	37	10	0
New first pass	34	3	2

katesanders9 commented 11 months ago

Dataset cost

First pass:

Using ChatGPT (3.5): Estimated dialogue tokens per transcript: ~320 Rest of first prompt: 251 Output: ~120 Cost per 9 hypotheses: 0.002/1000*(320+251+120)=0.001382 Cost for 100,000 hypotheses: 0.001382*100000/9=15.36 (Need 12,000 clips (TVQA has 21,800 clips total))

Second pass:

Using ChatGPT (3.5): Estimated dialogue tokens per transcript: ~320 Rest of first prompt: 212 Output: ~136 Cost per 9 hypotheses: 0.002/1000*(320+212+136)=0.001336 Cost for 100,000 hypotheses: 0.001336*100000/9=14.08

Total cost

For 100,000 hypotheses using ChatGPT: 14.08+15.36=$29.44

katesanders9 / multimodal-proofs

Silver entailment dataset generation - With context #4

7-24 meeting notes

Summary

Data generation example

Evidence generation prompts

GPT Accuracy across 12 transcripts (108 hypotheses)

GT judgment distribution

Accuracy by show

Accuracy by GT label

Potential improvements for second pass:

Are you sure

Majority votes

E/N/C definitions

Comparing false positives/negatives (binary classification)

Dataset cost

First pass:

Second pass:

Total cost