Open katesanders9 opened 1 year ago
Presently, the architectures are kept to SBERT (for cosine similarity analysis) and CrossEncoder classifiers, both from the SentenceTransformers package.
Note: Here is a paper on transforming QA datasets into NLI datasets.
Ideally, the datasets used to train the filters will follow the general paradigm of (question, evidence dialogue)
pairs, corresponding to a larger document of dialogue exchanges. However, there are a limited number of datasets that fall into this category. Relevant QA datasets are listed below:
Note: The domain shift between SQuAD, QuAC, and CoCA is notable, and so code to convert data between the three formats has been published.
Note: Here's another repository that converts data between some more of the above datasets.
Other datasets exist that are dialogue-centric, but instead of including specific QA pairs that can be answered by a specific line of dialogue, the dialogue lines themselves are annotated for various attributes. These could feasibly be preprocessed using T5/etc. to turn them into dialogue-centric QA datasets.
Overview
Goal: Write the program that generates entailment trees for TVQA using dialogue only, and write corresponding evaluation scripts to assess performance.
Progress
Filters
Search
Evaluation
TBD