Goal

To generate a natural language inference dataset using dialogue from TV show scripts. For example:

(BURKE): I'd like to confirm those results.
(...): You just did.
(BURKE): In person. I want to confirm them in person.

entails Burke wants to confirm the results in person.

This dataset can be used either to train a filter for dialogue proof tree generation, or for general evaluation of existing filter models.

Text pre-processing

For example, we can use the following dialogue from the TVQA dataset:

(Ross:)We're just gonna need a little time.
(Monica:)I understand.
(Alan:)- Wow.
- I'm really sorry.
Yeah, I mean, I'm sorry too.
- But I'm a little relieved.
- Relieved?
Yeah, well, I mean,
I had a great time with you.
(Alan:)I just can't stand your friends.
(Rachel:)Remember when we went to Central Park
and rented boats?
(Rachel:)That was fun.
(Ross:)He could row like a Viking.
(Ross:)So how'd it go?
(Phoebe:)You know.
(Phoebe:)Did he mention us?
He says he's really gonna
miss you guys.
(Ross:)- Rough day, huh?
- You have no idea.
Come here.
(Chandler:)- That's it. I'm getting cigarettes.
- No!
I don't care! I'm weak!
(Chandler:)I've gotta have the smoke!
(Phoebe:)If you never smoke again,
I'll give you $7000.
Yeah, all right.

This passage contains substantial ambiguity in terms of who is saying what, but doesn't exactly line up with other transcripts so names can't be automatically added to each line. So, for now, I've mapped these transcripts to less ambiguous transcripts (for Friends, I'm pulling from http://friends.tktv.net/). When cleaned, this transcript looks like this:

ROSS: Hey hey, we'll be fine. We're just gonna need a little time.
MONICA: I understand.
ALAN: Wow.
MONICA: I'm, I'm really sorry.
ALAN: Yeah, I'm sorry too. But, I gotta tell you, I am a little relieved.
MONICA: Relieved?
ALAN: Yeah, well, I had a great time with you.. I just can't stand your friends.
RACHEL: Remember when we went to Central Park and rented boats?.. That was fun.
ROSS: Yeah. He could row like a viking.
MONICA: Hi.
ALL: Mmm.
ROSS: So how'd it go?
MONICA: Oh, y'know..
PHOEBE: Did he mention us?
MONICA: He said he's really gonna miss you guys.
ROSS: You had a rough day, huh.. c'mere.
CHANDLER: ...That's it. I'm getting cigarettes.
ALL: No no no!
CHANDLER: I don't care, I don't care! Game's over! I'm weak! I've gotta smoke! I've gotta have the smoke!
PHOEBE: If you never smoke again I'll give you seven thousand dollars!
CHANDLER: Yeah, alright.

The code to map and generate the cleaned alternate transcript is in tools/clean_text.py.

ChatGPT prompting

Two options for generating the dataset are (A) generating hypotheses with FLAN and then calculating judgments via ChatGPT and (B) doing both at once via ChatGPT. For simplicity, I think that going with option B might be preferable.

Baseline

Starting with Nathaniel's initial prompt:

Write a set of 9 natural language inference (NLI) test hypotheses using this transcript as the context. 3 of the hypotheses should be ENTAILED by the transcript, while 3 should be directly CONTRADICTED, and the last 3 should be NEITHER directly entailed nor contradicted. Your output format is a serialized json item on a single line with the following format: {"ENTAILMENT": ["<hypothesis>", ...], "CONTRADICTION": ["<hypothesis>"...], "NEITHER": ["<hypothesis>",...]} and nothing else.

The output for this is:

{"ENTAILMENT": ["Monica had a rough day.", "Chandler wants to smoke.", "Alan enjoyed spending time with Monica."], "CONTRADICTION": ["Monica's friends are likable.", "Chandler is determined to quit smoking.", "Alan didn't mention Monica's friends."], "NEITHER": ["Ross and Monica had a conversation.", "Rachel and Ross rented boats in Central Park.", "Phoebe offered Chandler money to quit smoking."]}

Single line prompting

Using a simpler prompt involving only one line of dialogue (ALAN: Yeah, well, I had a great time with you.. I just can't stand your friends.):

Write a set of three natural language inference (NLI) test hypotheses using this dialogue as the context. One hypothesis should be ENTAILED by the dialogue, one should be directly CONTRADICTED, and one should be NEITHER directly entailed nor contradicted. Your output format is a serialized json item on a single line with the following format: {"ENTAILMENT": "<hypothesis>", "CONTRADICTION": "<hypothesis>", "NEITHER": "<hypothesis>"} and nothing else.

The output is

{"ENTAILMENT": "Alan enjoyed spending time with the person despite disliking their friends.", "CONTRADICTION": "Alan enjoyed spending time with the person and liked their friends.", "NEITHER": "Alan's feelings about the person's friends are unknown."}

which produces correctly labeled hypotheses. Another two examples:

INPUT: PHOEBE: If you never smoke again I'll give you seven thousand dollars! OUTPUT: {"ENTAILMENT": "Phoebe promised to give someone seven thousand dollars if they never smoke again.", "CONTRADICTION": "Phoebe did not offer any monetary incentive for not smoking again.", "NEITHER": "Phoebe made a bet with someone to quit smoking."}

INPUT: MONICA: He said he's really gonna miss you guys. OUTPUT: {"ENTAILMENT": "Monica's friends are going to miss someone.", "CONTRADICTION": "Monica's friends are not going to miss anyone.", "NEITHER": "Monica's friends are going to throw a going-away party for someone."}

This last example includes additional inference that isn't grounded in the dialogue (the mention of "friends"). This is unfortunate and is worth keeping in mind, but most likely wouldn't have a particularly strong negative impact on training or evaluation.

Triplet prompting

Here are examples demonstrating how ChatGPT handles inference generation using dialogue triplets:

INPUT:

CHANDLER: I don't care, I don't care! Game's over! I'm weak! I've gotta smoke! I've gotta have the smoke!
PHOEBE: If you never smoke again I'll give you seven thousand dollars!
CHANDLER: Yeah, alright.

OUTPUT: {"ENTAILMENT": "Chandler agrees to quit smoking in exchange for seven thousand dollars.", "CONTRADICTION": "Chandler continues to smoke despite Phoebe's offer of seven thousand dollars.", "NEITHER": "Phoebe gives Chandler seven thousand dollars."}

INPUT:

ALAN: Yeah, I'm sorry too. But, I gotta tell you, I am a little relieved.
MONICA: Relieved?
ALAN: Yeah, well, I had a great time with you.. I just can't stand your friends.

OUTPUT: {"ENTAILMENT": "Alan enjoyed spending time with Monica's friends.", "CONTRADICTION": "Alan mentioned that he had a great time with Monica's friends.", "NEITHER": "Alan expressed relief because Monica's friends were not present during their time together."}

INPUT:

PHOEBE: Did he mention us?
MONICA: He said he's really gonna miss you guys.
ROSS: You had a rough day, huh.. c'mere.

OUTPUT: {"ENTAILMENT": "Phoebe is going to miss someone.", "CONTRADICTION": "Ross had a smooth day.", "NEITHER": "Monica is upset."}

These inferences generally are of lower quality than the single-dialogue options.

Other prompting considerations

Adversarial triplets

In addition to regular issues with prompting with triplets, scene changes can occur without any note in the transcript. For example,

INPUT:

ALAN: Yeah, well, I had a great time with you.. I just can't stand your friends.
RACHEL: Remember when we went to Central Park and rented boats?.. That was fun.
ROSS: Yeah. He could row like a viking.

OUTPUT: {"ENTAILMENT": "Alan enjoyed his time with Rachel despite disliking her friends.", "CONTRADICTION": "Alan had a great time with Rachel and liked her friends.", "NEITHER": "Rachel and Ross rented boats in Central Park."}

Okay well this one went fine but the case still stands that the model has no way of knowing that Alan wasn't involved in the conversation with Rachel and Ross.

Audio cues

The prompting setup does not consider tone or sarcasm, but neither does the overall architecture. For example, the passage "He said he's really going to miss you guys" is said sarcastically in the TV show, but this is not conveyed through the text. This is something to consider in the future, as audio is necessary here to identify the impact that delivery has on the meaning of the sentence.

Pricing

According to this website, ChatGPT token cost is $0.002/1k and a token is approx. 4 characters. Using the single-line prompt, unoptimized, a prompt for three inferences is ~550 characters and the output is ~250 characters.

This would be about ($0.002 / 1000) * (550 + 250) / 4 = $0.0004 for three hypotheses, or $40 for 300,000 hypotheses.

We could pull dialogue from any shows and use them for the dataset, so collecting 100,000 TV show dialogue lines shouldn't be difficult. If we did want to stick to the shows used in TVQA: The dataset consists of 21,800 video clips with approx. 22.5 dialogue lines per clip, or ~490,500 dialogue lines. Taking around 20% of this data to use as training data would also be reasonable.

We could also go with fewer hypotheses than this - this is just what I assume would be an appropriate upper-bound for dataset size if we are interested in training as well as evaluation.

One question to consider is whether to run some sort of bootstrapping procedure in order to cater the space of hypotheses towards those that will be generated as queries during proof search. i.e. If we know the nature of the recursive generations that FLAN or otherwise would be considering at each time step, we can generate some number of those and collect 'Gold' ChatGPT or human annotations for them. You can then the gold annotations as in-context exemplars when generating the rest of the dataset.

This would increase the cost per prompt by a bit, but might be worthwhile wrt the end task performance. Or you could try a little bit of both, do half non-decomposition-generator specific and half generator specific.

But anyways, i think this is something to bang out for pretty cheap. Perhaps you can run a couple studies comparing the quality of data and of judgements by ChatGPT vs GPT-4 (vs trained baselines, etc) for the sake of comparison. And run a turking task to verify that ChatGPT reaches something akin to human performance. Then wrap it into a workshop paper for EMNLP before using it for the TVQA system.

Sounds good. First step will be to refine proof search query generation enough that we can assess the value of collecting sample hypotheses for ChatGPT annotation collection vs. generating hypotheses directly via ChatGPT during annotation. Specifically, this means getting FLAN/GPT to output sub-queries given an initial hypothesis and a line of dialogue as partial evidence. ChatGPT was successful with very simple prompts, but I'm now collecting a more varied set of example data that requires more sophisticated branching.

Note: We might want to use Boyuan's coreference model for parsing dialogue. OR, we could provide additional dialogue as "background context" (samples at the bottom).

Proposed decomposition algorithm:

Input a hypothesis and a transcript
Select the dialogue line with the highest entailment score.
If entailment score above cutoff threshold, return.
Otherwise, generate a second hypothesis that would complete the proof given the original hypothesis and the candidate dialogue.
Recurse on the new hypothesis.

With this algorithm, the number of recursive steps depend on the number of dialogue lines necessary to prove the hypothesis.

Example decompositions

Base case

HYPOTHESIS: Wilson is uncomfortable with violating a dead colleague's dignity.
D1: (Wilson:) I'm uncomfortable violating a dead colleague's dignity.

Recursion level 1

HYPOTHESIS: The headmaster suggests Castle and Beckett talk to Donny's friends when they're inquiring about Donny.
D1: (Headmaster:)If you want to know the rest, you should probably talk to his friends.
H2: Castle and Beckett are inquiring about Donny.
> D2: (Castle:)Any idea what he would have been doing at Central Park at night?

Recursion level 2

HYPOTHESIS: House is talking to himself about others.
D1: (House:)on account of the fact that she's getting worse.
H2: House is talking to himself.
> D2: (House:)- Can you still hear me?
> H3: Someone says "no".
>> D3: - No.

HYPOTHESIS: There are only 5 things that could cause the problem and those 5 have been eliminated.
D1: (Chase:)The body only has so many ways
of screwing with the kidneys.
H2: There are five things that could cause the kidney problem and those five have been eliminated.
> D2: (Cameron:) That's it. Has to be one of those five.
> H3: The five things have been eliminated.
>> D3: (House:) She's 0 for five.

Recursion level 3

I ran through about 75 hypotheses and none of them required three levels of recursion.

Examples with ChatGPT

User:

Given a hypothesis and line of dialogue, write a fact that, alongside the dialogue, would make the hypothesis true.

HYPOTHESIS: Bob is talking to himself about his patient.
DIALOGUE: (BOB) In fact, she's getting worse.
FACT: Bob is talking to himself.

Given a hypothesis and line of dialogue, write a fact that, alongside the dialogue, would make the hypothesis true.

HYPOTHESIS: The headmaster suggests Castle and Beckett talk to Donny's friends when they're inquiring about Donny.
DIALOGUE: (HEADMASTER) If you want to know the rest, you should probably talk to Donny's friends.

ChatGPT:

 FACT: Castle and Beckett are inquiring about Donny.

User:

Given a hypothesis and line of dialogue, write a fact that, alongside the dialogue, would make the hypothesis true.

HYPOTHESIS: There are five things that could cause the kidney problem and those five have been eliminated.
DIALOGUE: (CAMERON) That's it. Has to be one of those five.

ChatGPT:

FACT: Five potential causes of the kidney problem were identified and eliminated through investigation.

User:

Given a hypothesis and line of dialogue, write a fact that, alongside the dialogue, would make the hypothesis true.

HYPOTHESIS: Frank said he was selling access on the side when explaining his security job to Ryan.
DIALOGUE: (FRANK) So, I made some cash on the side by selling access.

ChatGPT:

FACT: Frank is explaining his security job to Ryan.

User:

Given a hypothesis and line of dialogue, write a fact that, alongside the dialogue, would make the hypothesis true.

HYPOTHESIS: Cameron thought the Senator may have had a stroke when he was intubated because blood clots are common in paralyzed patients.
DIALOGUE: (CAMERON) He could've suffered a stroke when he was intubated.

ChatGPT:

FACT: The Senator was intubated, and blood clots are common in paralyzed patients.

ChatGPT coreference resolution

User:

Here is a transcript: 

ROSS: Hey hey, we'll be fine. We're just gonna need a little time.
MONICA: I understand.
ALAN: Wow.
MONICA: I'm, I'm really sorry.
ALAN: Yeah, I'm sorry too. But, I gotta tell you, I am a little relieved.
MONICA: Relieved?
ALAN: Yeah, well, I had a great time with you.. I just can't stand your friends.
RACHEL: Remember when we went to Central Park and rented boats?.. That was fun.
ROSS: Yeah. He could row like a viking.
MONICA: Hi.
ALL: Mmm.
ROSS: So how'd it go?
MONICA: Oh, y'know..
PHOEBE: Did he mention us?
MONICA: He said he's really gonna miss you guys.
ROSS: You had a rough day, huh.. c'mere.
CHANDLER: ...That's it. I'm getting cigarettes.
ALL: No no no!
CHANDLER: I don't care, I don't care! Game's over! I'm weak! I've gotta smoke! I've gotta have the smoke!
PHOEBE: If you never smoke again I'll give you seven thousand dollars!
CHANDLER: Yeah, alright.

Please rewrite the transcript with pronouns replaced by the name they refer to.

ChatGPT:

ROSS: Hey hey, Monica, we'll be fine. We're just gonna need a little time.
MONICA: I understand.
ALAN: Wow.
MONICA: I'm, I'm really sorry.
ALAN: Yeah, Monica, I'm sorry too. But, Monica, I gotta tell you, I am a little relieved.
MONICA: Relieved?
ALAN: Yeah, well, Monica, I had a great time with you.. I just can't stand your friends.
RACHEL: Monica, remember when we went to Central Park and rented boats? That was fun.
ROSS: Yeah, Monica. Alan could row like a viking.
MONICA: Hi, everyone.
ALL: Mmm.
ROSS: So how'd it go, Monica?
MONICA: Oh, y'know..
PHOEBE: Monica, did he mention us?
MONICA: Alan said he's really gonna miss you guys.
ROSS: Monica, you had a rough day, huh.. c'mere.
CHANDLER: ...That's it. I'm getting cigarettes.
ALL: No no no!
CHANDLER: I don't care, I don't care! Game's over! I'm weak! I've gotta smoke! I've gotta have the smoke!
PHOEBE: Chandler, if you never smoke again I'll give you seven thousand dollars!
CHANDLER: Yeah, alright.

This actually does an excellent job of removing ambiguity, but it would be very costly to compute for the entire dataset. Worth keeping in mind if coreference resolution proves to be a significant problem.

Now comparing these sorts of generated hypotheses vs. on-the-fly inference hypotheses for dataset construction purposes

1

Note: We might want to use Boyuan's coreference model for parsing dialogue.

Input a hypothesis and a transcript Select the dialogue line with the highest entailment score. If entailment score above cutoff threshold, return. Otherwise, generate a second hypothesis that would complete the proof given the original hypothesis and the candidate dialogue. Recurse on the new hypothesis.

"select the dialogue line" is doing a lot of heavy lifting here. Is this meant to be a language-only baseline? Are you hoping to find single utterances that entail the hypothesis, or maybe do you want to keep the whole transcript snippet as grounding? I think the question of doing coref (which you're right, ChatGPT seems to be doing decently, though some yous were not replaced) depends on whether you're going to extract lines out of their contexts before recurring...I guess if you're using FLAN for generation, then it would indeed be helpful to decontextualize the statement prior to decomposition-- which model did you use for the recursion examples above?

Along these lines: once you retrieve D1, i wonder if you want to generate "H1" == the premise that got entailed by D1, i.e. the half of the composition not covered by H2, before generating H2. That might help generate good H2s and will increase interpretability. Maybe this would help avoid coref issues?

I ran through about 75 hypotheses and none of them required three levels of recursion.

So you got 75 good-looking trees? awesome! could you upload them to the repo for us to poke around at them?

Notes for proof branching & dialogue-hypothesis pair generation

Starting with dialogue chunk size d = 5 lines

My current toy dataset is 2034 hypothesis - dialogue pairs. Dialogue stats (lines per hypothesis by percentile):	Min	10th	25th	50th	75th	90th	Max
2	10	16	22	29	34	57

Note: Explicitly cutting up the transcript into individual chunks cuts up the transcript meaning pretty badly, e.g.

-----------------
ROSS: Hey hey, we'll be fine. We're just gonna need a little time.
MONICA: I understand.
ALAN: Wow.
MONICA: I'm, I'm really sorry.
ALAN: Yeah, I'm sorry too. But, I gotta tell you, I am a little relieved.
-----------------
MONICA: Relieved?
ALAN: Yeah, well, I had a great time with you.. I just can't stand your friends.
RACHEL: Remember when we went to Central Park and rented boats?.. That was fun.
ROSS: Yeah. He could row like a viking.
MONICA: Hi.
-----------------
ALL: Mmm.
ROSS: So how'd it go?
MONICA: Oh, y'know..
PHOEBE: Did he mention us?
MONICA: He said he's really gonna miss you guys.
-----------------
ROSS: You had a rough day, huh.. c'mere.
CHANDLER: ...That's it. I'm getting cigarettes.
ALL: No no no!
CHANDLER: I don't care, I don't care! Game's over! I'm weak! I've gotta smoke! I've gotta have the smoke!
PHOEBE: If you never smoke again I'll give you seven thousand dollars!
-----------------
CHANDLER: Yeah, alright.
-----------------

Example

For proof construction, currently extracting the supporting dialogue lines manually and then prompting ChatGPT to produce all of the downstream proof hypotheses.

Below is a (relatively) difficult dialogue-only TVQA example.

Hypothesis: Foreman rescinds his diagnosis.

CHASE: Then we should schedule him for vascular surgery. Go into the carotids, find the aneurysm, repair it.
CAMERON: If we put him on blood thinners, he might bleed out.
HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – 
FOREMAN: I think Chase is right.
HOUSE: Okay, if Foreman used to be right about it being blood clots, and we take the surgery route, then we’ll probably kill the guy. So, start him on blood thinners, and if he has another stroke, then we’ll schedule the surgery.

Manual proof

H0: Foreman rescinds his diagnosis.

    > H1: Foreman thought the patient had bacterial endocarditis.
    >> D1: "HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – "

    > H2: Foreman indicates he no longer thinks the patient has bacterial endocarditis.
        >> H2A: Chase thinks the patient has an aneurysm.
        >>> D2A: "CHASE: Then we should schedule him for vascular surgery. Go into the carotids, find the aneurysm, repair it."

        >> H2B: Foreman says he thinks Chase is right.
        >>> D2B: "FOREMAN: I think Chase is right."

GPT-generated proof: Depth 0

Dialogue is manually retrieved. This example is to observe GPT generation. All GPT-generated hypotheses should be passed through entailment filters to ensure that (1) no sub-hypothesis directly entails its parent hypothesis and (2) in conjunction, every pair of sub-hypotheses entail their parent hypothesis.

Current proof:

H0: Foreman rescinds his diagnosis.

GPT-generated proof: Depth 1 - H1 generation

Retrieved dialogue:

"HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – "

Prompt:

Write a set of statements that are entailed by the starred dialogue line in JSON format, i.e. {'1': '<answer here>', '2': '<answer here>', ...}.

TRANSCRIPT:
CHASE: Then we should schedule him for vascular surgery. Go into the carotids, find the aneurysm, repair it.
CAMERON: If we put him on blood thinners, he might bleed out.
***HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – 
FOREMAN: I think Chase is right.
HOUSE: Okay, if Foreman used to be right about it being blood clots, and we take the surgery route, then we’ll probably kill the guy.

ENTAILMENTS:

ChatGPT:

{
'1': 'Foreman thinks it might be bacterial endocarditis.',
'2': 'If the patient has bacterial endocarditis, blood thinners might be a viable treatment option.',
'3': 'Chase's suggestion is to perform vascular surgery to find and repair an aneurysm.',
'4': 'Chase and Foreman agree on the potential diagnosis and treatment plan.',
'5': 'House is concerned that if they choose surgery and Foreman's initial diagnosis of blood clots was correct, it could be fatal for the patient.',
}

Current proof:

H0: Foreman rescinds his diagnosis.
    > H1: Foreman thinks it might be bacterial endocarditis.
    > D1: "HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – "

GPT-generated proof: Depth 1 - H2 generation

Prompt:

Write entailments from the transcript that each make the hypothesis true when paired with the provided fact. Write your entailment in json format, i.e. {'1': <entailment 1>, '2': <entailment 1>, ...}.

HYPOTHESIS: Foreman rescinds his diagnosis.
FACT: Foreman thinks it might be bacterial endocarditis.
TRANSCRIPT:

CHASE: Then we should schedule him for vascular surgery. Go into the carotids, find the aneurysm, repair it.
CAMERON: If we put him on blood thinners, he might bleed out.
HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – 
FOREMAN: I think Chase is right.
HOUSE: Okay, if Foreman used to be right about it being blood clots, and we take the surgery route, then we’ll probably kill the guy.

ENTAILMENTS:

ChatGPT:

{
'1': "Foreman diagnosed the patient with bacterial endocarditis.",
'2': "Foreman's previous diagnosis was about blood clots.",
'3': "Foreman's current opinion aligns with Chase's suggestion for vascular surgery.",
'4': "The patient might have blood clots.",
'5': "Chase's recommendation is to put the patient on blood thinners.",
'6': "Choosing the surgery route would likely result in the patient's death.",
'7': "Foreman agrees with Chase's assessment."
}

Current proof:

H0: Foreman rescinds his diagnosis.
    > H1: Foreman thinks it might be bacterial endocarditis.
    > D1: "HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – "

    > H2: Foreman's current opinion aligns with Chase's suggestion for vascular surgery.

GPT-generated proof: Depth 2 - H3 generation

Retrieved dialogue:

CHASE: Then we should schedule him for vascular surgery. Go into the carotids, find the aneurysm, repair it.

Prompt:

Write a set of statements that are entailed by the starred dialogue line in JSON format, i.e. {'1': '<answer here>', '2': '<answer here>', ...}.

TRANSCRIPT:
***CHASE: Then we should schedule him for vascular surgery. Go into the carotids, find the aneurysm, repair it.
CAMERON: If we put him on blood thinners, he might bleed out.
HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – 
FOREMAN: I think Chase is right.
HOUSE: Okay, if Foreman used to be right about it being blood clots, and we take the surgery route, then we’ll probably kill the guy.

ENTAILMENTS:

ChatGPT:

{
'1': 'Chase suggests scheduling the patient for vascular surgery.',
'2': 'The surgery would involve going into the carotids to find and repair an aneurysm.',
'3': 'Cameron expresses concern that putting the patient on blood thinners might lead to excessive bleeding.',
'4': 'House considers the possibility of bacterial endocarditis as the underlying condition.',
'5': 'Foreman agrees with Chase's suggestion about vascular surgery.',
'6': 'House acknowledges that if Foreman's previous idea about blood clots was correct and they choose the surgery route, the patient might die.',
'7': 'There is a potential disagreement between Chase and House on one side and Foreman on the other regarding the best course of action for the patient.'
}

Current proof:

H0: Foreman rescinds his diagnosis.
    > H1: Foreman thinks it might be bacterial endocarditis.
    > D1: "HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – "

    > H2: Foreman's current opinion aligns with Chase's suggestion for vascular surgery.
        > H3: Chase suggests scheduling the patient for vascular surgery.
        > D3: "CHASE: Then we should schedule him for vascular surgery. Go into the carotids, find the aneurysm, repair it."

GPT-generated proof: Depth 2 - H4 generation

Prompt:

Write entailments from the transcript that each make the hypothesis true when paired with the provided fact. Write your entailment in json format, i.e. {'1': <entailment 1>, '2': <entailment 1>, ...}.

HYPOTHESIS: Foreman's current opinion aligns with Chase's suggestion for vascular surgery.
FACT: Chase suggests scheduling the patient for vascular surgery.
TRANSCRIPT:

CHASE: Then we should schedule him for vascular surgery. Go into the carotids, find the aneurysm, repair it.
CAMERON: If we put him on blood thinners, he might bleed out.
HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – 
FOREMAN: I think Chase is right.
HOUSE: Okay, if Foreman used to be right about it being blood clots, and we take the surgery route, then we’ll probably kill the guy.

ENTAILMENTS:

ChatGPT:

{
'1': "Foreman agrees with Chase's recommendation for vascular surgery.",
'2': "Foreman supports Chase's proposal for vascular surgery.",
'3': "Foreman is in favor of scheduling the patient for vascular surgery, as suggested by Chase.",
'4': "Foreman concurs with Chase's idea of going ahead with vascular surgery.",
'5': "Foreman believes that scheduling the patient for vascular surgery, as proposed by Chase, is the right course of action."
}

Current proof:

H0: Foreman rescinds his diagnosis.
    > H1: Foreman thinks it might be bacterial endocarditis.
    > D1: "HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – "

    > H2: Foreman's current opinion aligns with Chase's suggestion for vascular surgery.
        > H3: Chase suggests scheduling the patient for vascular surgery.
        > D3: "CHASE: Then we should schedule him for vascular surgery. Go into the carotids, find the aneurysm, repair it."

        > H4: Foreman agrees with Chase's recommendation for vascular surgery.

GPT-generated proof: Depth 3 - Proof done

Retrieved dialogue: "FOREMAN: I think Chase is right."

Current proof:

H0: Foreman rescinds his diagnosis.
    > H1: Foreman thinks it might be bacterial endocarditis.
    > D1: "HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – "

    > H2: Foreman's current opinion aligns with Chase's suggestion for vascular surgery.
        > H3: Chase suggests scheduling the patient for vascular surgery.
        > D3: "CHASE: Then we should schedule him for vascular surgery. Go into the carotids, find the aneurysm, repair it."

        > H4: Foreman agrees with Chase's recommendation for vascular surgery.
        > D4: "FOREMAN: I think Chase is right."

This proof isn't perfect. Technically doesn't work unless it is established that Foreman thought it might be bacterial endocarditis, and now agrees with Chase's suggestion. And, the hypothesis "Chase suggests scheduling the patient for vascular surgery." should really be "Chase believes the patient has an aneurysm". However, I think it's a decent starting point for the generation system given the complexity of the necessary inferences?

Case study takeaways

With the two prompts used in this example, ChatGPT seems capable of generating appropriate hypotheses for high-level dialogue-centric proofs.
However, these hypotheses need to be passed through entailment filters to ensure that they are logical.
Branching factor of un-filtered hypotheses seems to be around ~7.
Branching factor of filtered hypotheses will ideally be smaller, possibly 2~3.
The generation prompts used in this exercise work for dialogue-only proofs, but not necessarily proofs that involve visual content.

Next step: Exploring automatic ways to generate NLI for these generated dialogue-hypothesis pairs, and improving the outputs of ChatGPT - some of the prompts result in a set of 5-7 very similar outputs.

GPT tends to pull too much information from the related context and not generate enough (if any) hypotheses that are entailed by the highlighted dialogue line alone. I think that this will be a problem in that it will oversimplify most of the TVQA proofs and only show that "GPT can answer these questions". I'm also looking into ways to avoid this without just feeding single lines into ChatGPT without any context.

Update: Seeing if "dialogue triplets" is a better way to generate contextually-aware and explicitly grounded hypotheses.

Using dialogue triplets

Using the same example TVQA datapoint as last time.

Prompt:

Write five statements that are entailed by the dialogue in JSON format, i.e. {'1': '<answer here>', '2': '<answer here>', ...}.

TRANSCRIPT:
CAMERON: We put him on blood thinners, he might bleed out.
HOUSE: But if Foreman's right about it being bacterial endocarditis and we...
FOREMAN: I think Chase is right.

ENTAILMENTS:

ChatGPT:

{
'1': 'Cameron expressed concerns about putting the patient on blood thinners due to the risk of bleeding out.',
'2': 'Foreman agrees with Chase's opinion on the diagnosis of bacterial endocarditis.',
'3': 'House is skeptical about using blood thinners on the patient.',
'4': 'Chase's diagnosis is supported by Foreman.',
'5': 'There is a disagreement among the characters regarding the treatment plan for the patient.'
}

Ground-truth inferences for these:		1	2	3	4	5
NLI	E	N	N	E	E

Cheating and going with 4:

Prompt:

Write entailments from the transcript that each make the hypothesis true when paired with the provided fact. Write your entailment in json format, i.e. {'1': <entailment 1>, '2': <entailment 1>, ...}.

HYPOTHESIS: Foreman rescinds his diagnosis.
FACT: Chase's diagnosis is supported by Foreman.

TRANSCRIPT:
CHASE: Then we should schedule him for vascular surgery. Go into the carotids, find the aneurysm, repair it.
CAMERON: If we put him on blood thinners, he might bleed out.
HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – 
FOREMAN: I think Chase is right.
HOUSE: Okay, if Foreman used to be right about it being blood clots, and we take the surgery route, then we’ll probably kill the guy.

ENTAILMENTS:

ChatGPT:

{
'1': 'Foreman initially thought it was bacterial endocarditis.',
'2': 'Foreman's previous diagnosis was about blood clots.',
'3': 'Foreman agreed with Chase on his diagnosis.',
'4': 'Foreman supported Chase's diagnosis of vascular surgery.',
'5': 'Foreman reevaluated his diagnosis and now agrees with Chase.'
}

1 here would complete the proof better than the original proof example did and is also entailed by the original triplet. I think that working with dialogue triplets will simplify the problem described above.

For reference, here are the hypotheses produced by GPT in step 2 when using hypothesis 1: Cameron expressed concerns about putting the patient on blood thinners due to the risk of bleeding out. and hypothesis 5: There is a disagreement among the characters regarding the treatment plan for the patient.:

Hypothesis 1 output:

{
'1': 'Foreman previously diagnosed the patient with blood clots.',
'2': 'Foreman's diagnosis was considered to be bacterial endocarditis.',
'3': 'Chase's suggestion of scheduling vascular surgery was based on the assumption that Foreman was right about the diagnosis.',
'4': 'Putting the patient on blood thinners could lead to potential bleeding out, according to Cameron.',
'5': 'The surgery route is risky and may result in the patient's death if Foreman's diagnosis of bacterial endocarditis is correct.'
}

Hypothesis 5 output:

{
'1': 'Foreman made a diagnosis that he now disagrees with.',
'2': 'Foreman's diagnosis was related to bacterial endocarditis.',
'3': 'Chase's recommendation is to schedule the patient for vascular surgery.',
'4': 'Using blood thinners might cause the patient to bleed out.',
'5': 'The characters are discussing the risks of different treatment plans.',
'6': 'House is concerned that the surgery route might result in the patient's death.'
}

Each of these outputs either (1) do not entail the main hypothesis in conjunction with the original sub-hypothesis or (2) in the case of 'Foreman made a diagnosis that he now disagrees with.', restates the main hypothesis and would extend the recursion depth of the proof.

One aspect of the dialogue that isn't considered through this GPT prompting approach is the relative time at which each dialogue triplet occurs in the text. I'm not sure how this aspect of the transcript can be incorporated into the proof system without building an additional time-based logic system (which could be interesting).

Nice example for motivation

Prompt:

Use the transcript to prove the hypothesis.

HYPOTHESIS: Foreman rescinds his diagnosis.

TRANSCRIPT:
CHASE: Then we should schedule him for vascular surgery. Go into the carotids, find the aneurysm, repair it.
CAMERON: If we put him on blood thinners, he might bleed out.
HOUSE: But if Foreman’s right about it being bacterial endocarditis, and we – 
FOREMAN: I think Chase is right.
HOUSE: Okay, if Foreman used to be right about it being blood clots, and we take the surgery route, then we’ll probably kill the guy.

PROOF:

ChatGPT:

The transcript does not provide sufficient evidence to prove the hypothesis that "Foreman rescinds his diagnosis." The conversation in the transcript indicates a medical discussion among the characters Chase, Cameron, House, and Foreman regarding the patient's condition and treatment options. Foreman's statement, "I think Chase is right," does not necessarily imply that he is rescinding his diagnosis. It only suggests that he agrees with Chase's perspective on the matter at that particular moment.

Filtering generated hypotheses

As shown in the previous examples, there is currently nothing checking to ensure that the branching hypotheses are logically sound.

Using hypotheses generated in the "Using dialogue triplets" comment as toy test data for NLI models. The produced H2 hypotheses in this section are a mixture of entailments, neutral statements, and contradictions when considering the original hypothesis, H0.

RoBERTa model trained on SNLI and MultiNLI: Model fails on most of the inputs - given "Foreman rescinds his diagnosis" as the first sentence, it considers anything involving "Foreman agrees with Chase" to be a contradiction, and labels everything else as "neutral".

This may be an issue with the medical jargon present in the example, so went with another data point, but still strong "neutral" bias, e.g., "Phoebe promises to give Chandler money if he stops smoking." and "Phoebe offers Chandler $7000 to quit smoking." results in a Neutral entailment output.

It looks like narrative-centric sentences might be too much of a domain shift for models trained on SNLI/MultiNLI.

"Time logic" notes

For simplicity, currently limiting the input data to the subsections labeled in TVQA as the primary evidence for each QA pair. This can be extended later on to increase the difficulty of the task.

Initially, it was planned to incorporate "time logic" based on the time-related keywords present in the generated hypothesis. However, upon further inspection, a bit over half of the hypotheses in the hypothesis set use the word "where" and a very small portion (~1%) use "before" or "after". Therefore, it would make the most sense to ignore this and focus on retrieving the relevant dialogue from a localized section of the video.

Silver entailment data

Generating (hypothesis, dialogue triplet) pairs for entailment generation

Will use the hypothesis training set (~2000 hypotheses)
Use SBERT to retrieve relevant dialogue chunks from the "evidence" section of the TV clip
With this data, retrieve sub-hypotheses via ChatGPT (branch factor of 5)
Retrieve dialogue with SBERT for the sub-hypotheses

Running 5 examples to see data quality (putting results below). If good selection of E vs. N vs. C, can collect full set. If retrieving ~3 dialogue lines per hypothesis (and going with n=5 branching for hypothesis generation), can generate (1 x 3) + (1 x 3 x 5 x 3) = 48 hypothesis-dialogue pairs per original hypothesis, or 98,064 pairs in total. This would be much cheaper than $50, but still need a way to annotate it

katesanders9 / multimodal-proofs

Silver entailment dataset generation #3

Goal

Text pre-processing

ChatGPT prompting

Baseline

Single line prompting

Triplet prompting

Other prompting considerations

Adversarial triplets

Audio cues

Pricing

Proposed decomposition algorithm:

Example decompositions

Base case

Recursion level 1

Recursion level 2

Recursion level 3

Examples with ChatGPT

ChatGPT coreference resolution

1

Notes for proof branching & dialogue-hypothesis pair generation

Example

Manual proof

GPT-generated proof: Depth 0

GPT-generated proof: Depth 1 - H1 generation

GPT-generated proof: Depth 1 - H2 generation

GPT-generated proof: Depth 2 - H3 generation

GPT-generated proof: Depth 2 - H4 generation

GPT-generated proof: Depth 3 - Proof done

Case study takeaways

Using dialogue triplets

Nice example for motivation

Filtering generated hypotheses

"Time logic" notes

Silver entailment data