katesanders9 commented 1 year ago

Overview

TBD

Progress

[X] Establish repo with some likely useful NELLIE code
[x] Implement TVQA HF dataset on BRTX
[x] Set up TVQA event hypothesis generation with FLAN-T5
[x] Fine-tune SBERT model
[x] Implement a search index

Evaluation

Each question in TVQA is paired with a timestamp annotation marking the video data that answers the question. As all transcript annotations are marked with timestamps, the relevant dialogue can be directly mapped to each TVQA question.

Each TVQA question can be divided into an interrogative clause and a temporal clause. Each temporal clause is preceded by a time-centric keyword ("when", "after", "before") indicating where the question-relevant video portion exists in relation to the temporal clause.

Initial evaluation will be done on TVQA questions that include the "when" keyword, because these questions (contd, TBD)

nweir127 commented 1 year ago

step 4: https://github.com/nweir127/guided_inference/blob/main/src/t5_qa2d.py

I can point you to the path to the right checkpoint on brtx.

Or, if you want to use FLAN to do ICL over some of your own handwritten examples, that might be better since it'll be more domain-specific.
flan.zip

katesanders9 commented 1 year ago

Hypothesis generation notes

Hypotheses are generated from TVQA question-answer pairs using FLAN-T5-XL.

Zero-shot learning

First, FLAN-T5 was evaluated on TVQA data without in-context learning. It successfully generated full hypotheses when the question was worded with grammatically correct English and the answer was not a full sentence, i.e.

>>> Combine the question/answer pair into a single declarative statement:
Q: What room was Wilson breaking into when House found him?
A: The kitchen.
<<< Wilson was breaking into the kitchen when House found him.

Sometimes, the model also produces the correct output when the answer is a complete sentence:

>>> Combine the question/answer pair into a single declarative statement:
Q: What instrument is Raj playing when Raj and Howard have their show?
A: Raj is playing guitar.
<<< Raj is playing guitar when Raj and Howard have their show.

Notably, this query does not succeed when "declarative statement" is replaced with "hypothesis":

>>> Combine the question/answer pair into a single hypothesis:
Q: What instrument is Raj playing when Raj and Howard have their show?
A: Raj is playing guitar.
<<< It is not possible to tell.

However, for other TVQA pairs, the model leaves out important details from the question when constructing the declarative sentence if the answer is a complete sentence:

>>> Combine the question/answer pair into a single declarative statement:
Q: How does Burke want to confirm the results after talking to the doctors?
A: Burke wants to confirm the results in person.
<<< Burke wants to confirm the results in person.

Whereas, if the answer is written as a sentence fragment, the model outputs all the relevant information:

>>> Combine the question/answer pair into a single declarative statement:
Q: How does Burke want to confirm the results after talking to the doctors?
A: In person.
<<< Burke wants to confirm the results in person after talking to the doctors.

In-context learning

In an attempt to correct this, FLAN-T5-XL was then given the query prepended with example inputs and outputs. This resulted in the correct output for the difficult query above:

>>> Combine the question/answer pair into a single declarative statement:
Q: Why is Castle vexed after he reads the note?
A: Castle believed he will see blood in the kitchen.
Castle is vexed after he reads the note because Castle believed he will see blood in the kitchen.

Combine the question/answer pair into a single declarative statement:
Q: How does Burke want to confirm the results after talking to the doctors?
A: Burke wants to confirm the results in person.

<<< Burke wants to confirm the results in person after talking to the doctors.

Adversarial inputs

In TVQA, sometimes the questions and answers are not written in grammatically correct English, such as

Q: How Burke wants to confirm the resulst after end talking to the doctors?
A: Burke wants to confirm the results in person.

This input was prepended with the example question and answer used above, but in this case the model produced the original problematic output:

<<< Burke wants to confirm the results in person.

To address this, a second example input was prepended to the grammatically incorrect query:

>>> Combine the question/answer pair into a single declarative statement:
Q: Why is Castle vexed after he reads the note?
A: Castle believed he will see blood in the kitchen.
Castle is vexed after he reads the note because Castle believed he will see blood in the kitchen.

Combine the question/answer pair into a single declarative statement:
Q: What is Robin holding in her hand when she is talking to Ted about Zoey?
A: A beer bottle
Robin is holding a beer bottle in her hand when she is talking to Ted about Zoey.

Combine the question/answer pair into a single declarative statement:
Q: How Burke wants to confirm the resulst after end talking to the doctors?
A: Burke wants to confirm the results in person.

But the output was the same. Out of curiosity, I tried the following input, prepended by the same examples:

>>> Combine the question/answer pair into a single declarative statement:
Q: How Burke wants to confirm the resulst after end talking to the doctors?
A: In person
<<< Burke wants to confirm the resulst after end talking to the doctors in person

I'm not sure if this sentence could be interpreted by a language model. Furthermore, it might be impossible to get FLAN-T5 to rewrite input sentences using grammatically correct English when the input itself is difficult to read. It is possibly best to ignore questions written in poor English and focus on the ones that are legible and FLAN-interpretable.

nweir127 commented 1 year ago

I think the LM will be ok even with a typo in the output. You could try a correction ICL prompt with , pairs if you really want, but for now it is fine to have the typo in there. The entailment LM model probably won't care too much (you can always flag this particular example and track the reasoning system's behavior on it later down the road as you add modules. see if it gets tripped up just for the typo, and if it does we can revisit.)

katesanders9 commented 1 year ago

Data preprocessing notes

TVQA has 122,039 training questions and 15,253 validation questions. 74,032 training questions and 9,321 validation questions have the temporal keyword "when" (not at the start of the question). I'd like to isolate these questions because for them it is more likely that the answer evidence is directly located in the timeframe specified by the question's temporal clause.

The number of dialogue lines in the answer timeframe for each "when" question is:

min: 0.0
max: 43.0
median: 3.0
mean: 3.92

This results in an average of 4.92 dialogue "chunks" per question, or ~410152 dialogue chunks in total for "when" questions in the dataset.

Data example

Example "when" QA pair:

Q: Who does the headmaster suggest Castle and Beckett talk to when they're inquiring about Donny?
A: The headmaster suggests Castle and Beckett talk to Donny's friends.

and the corresponding dialogue:

(Headmaster:) The truth is, all these kids are icebergs. We only see the tip.
(Headmaster:) If you want to know the rest, you should probably talk to his friends.
(Headmaster:) Amanda, Romy, Brandon, Spencer and Max.

As shown above, it might eventually be useful to include additional dialogue surrounding the explicitly annotated time segment. In this example, that dialogue would include:

(Headmaster:) The family had been very generous in the past,
(Headmaster:) and Donny was one of our brightest. We thought he'd do great things.
(Castle:) Any idea what he would have been doing at Central Park at night?

<explicitly annotated dialogue section>

(Headmaster:) It's strange seeing them without Donny.
(Beckett:) Thank you. Thanks.

However, for now, using this extended dialogue sample as SBERT training data might result in poorer embeddings, since there is less certainty that each dialogue line is relevant to the question-answer pair.

SBERT training data

For fine-tuning SBERT for the FAISS search index, it might make sense to start with about 10,000 dialogue chunks for maybe <10 epochs. It might be good to sample this data such that the number of data points per TV show is proportional to the distribution of all the questions. This training dataset will consist of approx. 2,032 questions that need to be converted into hypotheses:	Show	Distribution
Big Bang Theory	0.193	392
Friends	0.245	498
HIMYM	0.070	142
Grey's Anatomy	0.065	132
House	0.212	431
Castle	0.216	439

SBERT is initially loaded as the all-mpnet-base-v2 model, and the text chunk mappings are made between hypotheses and their corresponding dialogue lines. Initial training parameters are set at 5 epochs and a batch size of 32. Triplet loss is used.

SBERT QA data at /srv/local2/ksande25/NS_data/TVQA/sbert_hypothesis_data.jsonl. SBERT hypothesis data at /srv/local2/ksande25/NS_data/TVQA/h_{}.jsonl. SBERT dialogue data at /srv/local2/ksande25/NS_data/TVQA/sbert_dialogue_data.jsonl. Indexing dialogue data at /srv/local2/ksande25/NS_data/TVQA/dialogue_data.jsonl. Actual # of Qs is 2034.

Hypothesis generation compute time

With a batch size of 128, two ICL examples per query, and 1 GPU, FLAN-T5-XL seems to be able to generate approximately 1,240 hypotheses per hour.

nweir127 commented 1 year ago

You can reduce compute time for hypothesis generation by chunking the data and using multiple gpus (init a different flan on each)

From: Kate Sanders @.> Sent: Thursday, January 12, 2023 4:06:22 PM To: katesanders9/multimodal-proofs @.> Cc: Nathaniel Weir @.>; Comment @.> Subject: Re: [katesanders9/multimodal-proofs] Dialogue retrieval index construction (Issue #1)

Data preprocessing notes

TVQA has 122,039 training questions and 15,253 validation questions. 74,032 training questions and 9,321 validation questions have the temporal keyword "when" (not at the start of the question). I'd like to isolate these questions because for them it is more likely that the answer evidence is directly located in the timeframe specified by the question's temporal clause.

The number of dialogue lines in the answer timeframe for each "when" question is:

min: 0.0 max: 43.0 median: 3.0 mean: 3.92

This results in an average of 4.92 dialogue "chunks" per question, or ~410152 dialogue chunks in total for "when" questions in the dataset.

SBERT training data

For fine-tuning SBERT for the FAISS search index, it might make sense to start with about 10,000 dialogue chunks for maybe <10 epochs. It might be good to sample this data such that the number of data points per TV show is proportional to the distribution of all the questions. This training dataset will consist of approx. 2,032 questions that need to be converted into hypotheses:

Show Distribution # of Qs in SBERT training Big Bang Theory 0.193 392 Friends 0.245 498 HIMYM 0.070 142 Grey's Anatomy 0.065 132 House 0.212 431 Castle 0.216 439

SBERT is initially loaded as the all-mpnet-base-v2 model, and the text chunk mappings are made between hypotheses and their corresponding dialogue lines. Initial training parameters are set at 5 epochs and a batch size of 32. Triplet loss is used.

Hypothesis generation compute time

With a batch size of 128, two ICL examples per query, and 1 GPU, FLAN-T5-XL seems to be able to generate approximately 1,240 hypotheses per hour.

— Reply to this email directly, view it on GitHubhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkatesanders9%2Fmultimodal-proofs%2Fissues%2F1%23issuecomment-1380993407&data=05%7C01%7Cnweir3%40jhu.edu%7C52144d31c6a541fd6bb508daf4e0e1e7%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C638091543946635792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Abeq3QqGBlL21C7ZDnHOVd5m%2FFlvM5SHSOX3Z9YA4qs%3D&reserved=0, or unsubscribehttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADYQE6MY2YK52YSL5YI42BTWSBW45ANCNFSM6AAAAAATXK2MWA&data=05%7C01%7Cnweir3%40jhu.edu%7C52144d31c6a541fd6bb508daf4e0e1e7%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C638091543946635792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=94qZqtj10%2BXYH504Dv8lY2tLNLyGgYMOEVXjBgI%2BTh4%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

katesanders9 / multimodal-proofs

Dialogue retrieval index construction #1

Overview

Progress

Evaluation

Hypothesis generation notes

Zero-shot learning

In-context learning

Adversarial inputs

Data preprocessing notes

Data example

SBERT training data

Hypothesis generation compute time