Closed WenzhengZhang closed 1 year ago
Sure thing, which task(s) specifically are you interested in first? I'd already migrated HotPotQA, from our original internal DSP that we used for the paper to the released DSPv1 framework, for other people so it's going to be the easiest.
We are making a v2 release of the framework in a little bit but I can make time for a separate branch with original v1 paper reproduction.
Sure thing, which task(s) specifically are you interested in first? I'd already migrated HotPotQA for other people so it's going to be the easiest.
We are making a v2 release of the framework in a little bit but I can make time for a separate branch with original v1 paper reproduction.
Thanks for your reply! I'm interested in the HotPotQA task in first.
Okay, I've shared the following with another person who'd asked a couple of months ago.
The main thing not discussed below is that you need a ColBERTv2 index over the Wikipedia 2017 abstracts dataset from HotPotQA. Do you have this data/index? (You can't use the 2018 full index we're hosting if you want to do a reproduction; it's quite different. You can use it for some quick testing but expect different results if you use a different index.)
Also note that this isn't identical to the paper's runs, because the framework (v1) evolved a lot past what's discussed in the Dec 2022 paper. Lastly, you'll need to make sure to run this from v1 code even after we release v2 (later today!).
We'll maintain v1 on the v1
branch on github.
1) Importing and setting up.
import dsp
davinci_002 = dsp.GPT3(model='text-davinci-002')
rm = dsp.ColBERTv2(port=2017, post_requests=True) # my colbert index of wiki 2017 abstracts
dsp.settings.configure(lm=davinci_002, rm=rm)
2) Preparing the dataset. Please download the official HotPotQA dataset and put the path below.
import random
from dsp import Example
class Dataset:
def __init__(self, train_seed=0, train_size=None, eval_seed=0, dev_size=None, test_size=None):
self.train_size = train_size
self.train_seed = train_seed
self.dev_size = dev_size
self.dev_seed = eval_seed
self.test_size = test_size
self.test_seed = eval_seed
self.name = self.__class__.__name__
@property
def train(self):
if not hasattr(self, '_train_'):
self._train_ = self._shuffle_and_sample('train', self._train, self.train_size, self.train_seed)
return self._train_
@property
def dev(self):
if not hasattr(self, '_dev_'):
self._dev_ = self._shuffle_and_sample('dev', self._dev, self.dev_size, self.dev_seed)
return self._dev_
@property
def test(self):
if not hasattr(self, '_test_'):
self._test_ = self._shuffle_and_sample('test', self._test, self.test_size, self.test_seed)
return self._test_
def _shuffle_and_sample(self, split, data, size, seed=0):
'''
The setting (seed=s, size=N) is always a subset
of the setting (seed=s, size=M) for N < M.
'''
data = list(data)
# Shuffle the data irrespective of the requested size.
base_rng = random.Random(seed)
base_rng.shuffle(data)
data = data[:size]
output = []
for example in data:
output.append(Example(**example, split=split))
# rng = random.Random(seed)
# rng.shuffle(data)
return output
import os
import ujson
hotpotqa_root = '/future/u/okhattab/data/HotPotQA'
class HotPotQA(Dataset):
def __init__(self, *args, only_hard_examples=True, **kwargs) -> None:
super().__init__(*args, **kwargs)
assert only_hard_examples
with open(os.path.join(hotpotqa_root, 'raw/hotpot_train_v1.1.json')) as g:
hotpotqa_raw = ujson.load(g)
official_train = []
for qid, raw_example in enumerate(hotpotqa_raw):
if raw_example['level'] == 'hard':
official_train.append({'qid': qid, 'question': raw_example['question'], 'answers': [raw_example['answer']]})
rng = random.Random(0)
rng.shuffle(official_train)
self._train = official_train[:len(official_train)*90//100]
self._dev = official_train[len(official_train)*90//100:]
self._test = []
3) Further prepare the data seeds.
from dsp.utils import dotdict, EM, F1
examples_per_seed = 200
num_seeds = 5
data_args = dotdict(train_seed=1, train_size=16, eval_seed=2023, dev_size=examples_per_seed * num_seeds, test_size=0)
dataset = HotPotQA(**data_args)
eval_set = dataset.dev
eval_offset = 0
eval_sets = []
train_sets = []
for train_seed in range(1, 1+num_seeds):
data_args.train_seed = train_seed
dataset = HotPotQA(**data_args)
train_set = dataset.train
eval_sets.append(eval_set[eval_offset:eval_offset+examples_per_seed])
train_sets.append(train_set)
assert len(eval_sets[-1]) == examples_per_seed, len(eval_sets[-1])
assert len(train_sets[-1]) == 16
eval_offset += examples_per_seed
4) Check that the data matches ours.
eval_sets[3][-2]
# Should print:
# {'qid': 41087, 'question': 'Which one of the Beatles died before the release of Somewhere in England in 1981?', 'answers': ['John Lennon'], 'split': 'dev'}
5) Use the following program. This isn't identical to the templates used in the DSP program from the paper but it's very close. It's basically the multi-hop DSP program from the compiler notebook. I checked it gets 50.3% EM. First, the Predict stage.
Question = dsp.Type(prefix="Question:", desc="${the question to be answered}")
Context = dsp.Type(prefix="Context:\n", desc="${sources that may contain relevant content}", format=dsp.passages2text)
Answer = dsp.Type(prefix="Answer:", desc="${a short factoid answer, often between 1 word and 5 words}", format=dsp.format_answers)
Rationale = dsp.Type(prefix="Rationale: Let's think step by step.", desc="${reasoning}")
qa_template = dsp.Template(instructions="Answer questions with short factoid answers.", question=Question(), answer=Answer())
qa_template_with_CoT = dsp.Template(instructions=qa_template.instructions, context=Context(), question=Question(), rationale=Rationale(), answer=Answer())
@dsp.transformation
def QA_predict(example: dsp.Example, CoT=False, num_preds=1) -> dsp.Example:
template = qa_template_with_CoT
if num_preds > 1:
assert CoT, CoT
example, completions = dsp.generate(template, n=num_preds, temperature=0.7)(example, stage='qa')
example.completions = completions
completions = dsp.majority(completions)
else:
example, completions = dsp.generate(template)(example, stage='qa')
return example.copy(answer=completions.answer)
6) Then Search.
SearchQuery = dsp.Type(prefix="Search Query:", desc="${a simple question for seeking missing information}")
SearchRationale = dsp.Type(prefix="Rationale: Let's think step by step. To answer this question, we first need to find out", desc="${the missing information}")
CondenseRationale = dsp.Type(prefix="Rationale: Let's think step by step. Based on the context, we have learned the following.", desc="${information from the context that provides useful clues}")
rewrite_template = dsp.Template(instructions="Write a search query that will help answer a complex question.", question=Question(), rationale=SearchRationale(), query=SearchQuery())
hop_template = dsp.Template(instructions=rewrite_template.instructions, context=Context(), question=Question(), rationale=CondenseRationale(), query=SearchQuery())
from dsp.utils import deduplicate
@dsp.transformation
def multihop_search(example: dsp.Example, max_hops=2, num_queries=10, k=7) -> dsp.Example:
example.background = []
example.context = []
for hop in range(max_hops):
# Generate queries
template = rewrite_template if hop == 0 else hop_template
if num_queries == 1:
example, completions = dsp.generate(template)(example, stage=f'h{hop}')
passages = dsp.retrieve(completions.query, k=k)
else:
num_queries = int(num_queries)
temperature = 0.7 if num_queries > 1 else 0.0
example, completions = dsp.generate(template, n=num_queries, temperature=temperature)(example, stage=f'h{hop}')
queries = [c.query for c in completions] + [example.question]
passages = dsp.retrieveEnsemble(queries, k=k)
# Arrange the passages for the next hop
example.context = ([completions[0].rationale] if hop > 0 else []) + example.background + passages
example.context = deduplicate(example.context)[:k]
example.background = deduplicate(example.background + passages)[:(hop+1)]
return example
7) Then Demonstrate. With the full program.
def multihop_QA_v3(x, num_queries=1, num_preds=1, num_passages=7, demos_=None):
demos = dsp.sample(x.train, k=16)
@dsp.transformation
def attempt(d: dsp.Example) -> dsp.Example:
x = dsp.Example(question=d.question, demos=dsp.all_but(demos, d))
x = multihop_search(x, num_queries=1, k=min(4, num_passages))
if not dsp.passage_match(x.context, d.answer): return None
x = QA_predict(x, CoT=True, num_preds=1)
if not dsp.answer_match(x.answer, d.answer): return None
return d.copy(**x)
x.demos = dsp.annotate(attempt)(demos, k=3, return_all=True) if demos_ is None else demos_
x = multihop_search(x, num_queries=num_queries, k=num_passages)
x = QA_predict(x, CoT=True, num_preds=num_preds)
return x
8) Then evaluation: 5x200 examples.
min_seed_to_run = 0
max_seeds_to_run = 5
min_examples_per_seed = 0
max_examples_per_seed = 200
finals = {}
def base_n10_demos(x):
return multihop_QA_v3(x, num_queries=10, num_preds=20, num_passages=7)
program = base_n10_demos
total = 0
correct = 0
for seed_idx in range(min_seed_to_run, max_seeds_to_run):
this_train_set, this_eval_set = train_sets[seed_idx], eval_sets[seed_idx][min_examples_per_seed:max_examples_per_seed]
this_train_set = [dsp.Example(qid=x['qid'], question=x['question'], answer=x['answers']) for x in this_train_set]
this_eval_set = [dsp.Example(qid=x['qid'], question=x['question'], answer=x['answers']) for x in this_eval_set]
# Evaluate here with this training set and this eval set.
9) Before you run on 5x200 examples, which can be expensive, please run on max_seeds_to_run=1 and max_examples_per_seed=50. This will run the 50 examples of 1000. The EM score we get on this tiny subset is 36%. The overall score (on all 1000 examples) is 50%.
Thank you so much for providing the details! I don't have the ColBERTv2 index over the Wikipedia 2017. I'll use your hosting index for a quick test first.
Sure, but FYI the 2018 passages are considerably longer and otherwise misaligned with HotPotQA (in terms of date and scope). Don't rely on them too much.
Sorry, I ran this code, but it shows error AttributeError: module 'dsp' has no attribute 'annotate' AttributeError: module 'dsp' has no attribute 'transformation'
Okay, I've shared the following with another person who'd asked a couple of months ago.
The main thing not discussed below is that you need a ColBERTv2 index over the Wikipedia 2017 abstracts dataset from HotPotQA. Do you have this data/index? (You can't use the 2018 full index we're hosting if you want to do a reproduction; it's quite different. You can use it for some quick testing but expect different results if you use a different index.)
Also note that this isn't identical to the paper's runs, because the framework (v1) evolved a lot past what's discussed in the Dec 2022 paper. Lastly, you'll need to make sure to run this from v1 code even after we release v2 (later today!).
We'll maintain v1 on the
v1
branch on github.
- Importing and setting up.
import dsp davinci_002 = dsp.GPT3(model='text-davinci-002') rm = dsp.ColBERTv2(port=2017, post_requests=True) # my colbert index of wiki 2017 abstracts dsp.settings.configure(lm=davinci_002, rm=rm)
- Preparing the dataset. Please download the official HotPotQA dataset and put the path below.
import random from dsp import Example class Dataset: def __init__(self, train_seed=0, train_size=None, eval_seed=0, dev_size=None, test_size=None): self.train_size = train_size self.train_seed = train_seed self.dev_size = dev_size self.dev_seed = eval_seed self.test_size = test_size self.test_seed = eval_seed self.name = self.__class__.__name__ @property def train(self): if not hasattr(self, '_train_'): self._train_ = self._shuffle_and_sample('train', self._train, self.train_size, self.train_seed) return self._train_ @property def dev(self): if not hasattr(self, '_dev_'): self._dev_ = self._shuffle_and_sample('dev', self._dev, self.dev_size, self.dev_seed) return self._dev_ @property def test(self): if not hasattr(self, '_test_'): self._test_ = self._shuffle_and_sample('test', self._test, self.test_size, self.test_seed) return self._test_ def _shuffle_and_sample(self, split, data, size, seed=0): ''' The setting (seed=s, size=N) is always a subset of the setting (seed=s, size=M) for N < M. ''' data = list(data) # Shuffle the data irrespective of the requested size. base_rng = random.Random(seed) base_rng.shuffle(data) data = data[:size] output = [] for example in data: output.append(Example(**example, split=split)) # rng = random.Random(seed) # rng.shuffle(data) return output import os import ujson hotpotqa_root = '/future/u/okhattab/data/HotPotQA' class HotPotQA(Dataset): def __init__(self, *args, only_hard_examples=True, **kwargs) -> None: super().__init__(*args, **kwargs) assert only_hard_examples with open(os.path.join(hotpotqa_root, 'raw/hotpot_train_v1.1.json')) as g: hotpotqa_raw = ujson.load(g) official_train = [] for qid, raw_example in enumerate(hotpotqa_raw): if raw_example['level'] == 'hard': official_train.append({'qid': qid, 'question': raw_example['question'], 'answers': [raw_example['answer']]}) rng = random.Random(0) rng.shuffle(official_train) self._train = official_train[:len(official_train)*90//100] self._dev = official_train[len(official_train)*90//100:] self._test = []
- Further prepare the data seeds.
from dsp.utils import dotdict, EM, F1 examples_per_seed = 200 num_seeds = 5 data_args = dotdict(train_seed=1, train_size=16, eval_seed=2023, dev_size=examples_per_seed * num_seeds, test_size=0) dataset = HotPotQA(**data_args) eval_set = dataset.dev eval_offset = 0 eval_sets = [] train_sets = [] for train_seed in range(1, 1+num_seeds): data_args.train_seed = train_seed dataset = HotPotQA(**data_args) train_set = dataset.train eval_sets.append(eval_set[eval_offset:eval_offset+examples_per_seed]) train_sets.append(train_set) assert len(eval_sets[-1]) == examples_per_seed, len(eval_sets[-1]) assert len(train_sets[-1]) == 16 eval_offset += examples_per_seed
- Check that the data matches ours.
eval_sets[3][-2] # Should print: # {'qid': 41087, 'question': 'Which one of the Beatles died before the release of Somewhere in England in 1981?', 'answers': ['John Lennon'], 'split': 'dev'}
- Use the following program. This isn't identical to the templates used in the DSP program from the paper but it's very close. It's basically the multi-hop DSP program from the compiler notebook. I checked it gets 50.3% EM. First, the Predict stage.
Question = dsp.Type(prefix="Question:", desc="${the question to be answered}") Context = dsp.Type(prefix="Context:\n", desc="${sources that may contain relevant content}", format=dsp.passages2text) Answer = dsp.Type(prefix="Answer:", desc="${a short factoid answer, often between 1 word and 5 words}", format=dsp.format_answers) Rationale = dsp.Type(prefix="Rationale: Let's think step by step.", desc="${reasoning}") qa_template = dsp.Template(instructions="Answer questions with short factoid answers.", question=Question(), answer=Answer()) qa_template_with_CoT = dsp.Template(instructions=qa_template.instructions, context=Context(), question=Question(), rationale=Rationale(), answer=Answer()) @dsp.transformation def QA_predict(example: dsp.Example, CoT=False, num_preds=1) -> dsp.Example: template = qa_template_with_CoT if num_preds > 1: assert CoT, CoT example, completions = dsp.generate(template, n=num_preds, temperature=0.7)(example, stage='qa') example.completions = completions completions = dsp.majority(completions) else: example, completions = dsp.generate(template)(example, stage='qa') return example.copy(answer=completions.answer)
- Then Search.
SearchQuery = dsp.Type(prefix="Search Query:", desc="${a simple question for seeking missing information}") SearchRationale = dsp.Type(prefix="Rationale: Let's think step by step. To answer this question, we first need to find out", desc="${the missing information}") CondenseRationale = dsp.Type(prefix="Rationale: Let's think step by step. Based on the context, we have learned the following.", desc="${information from the context that provides useful clues}") rewrite_template = dsp.Template(instructions="Write a search query that will help answer a complex question.", question=Question(), rationale=SearchRationale(), query=SearchQuery()) hop_template = dsp.Template(instructions=rewrite_template.instructions, context=Context(), question=Question(), rationale=CondenseRationale(), query=SearchQuery()) from dsp.utils import deduplicate @dsp.transformation def multihop_search(example: dsp.Example, max_hops=2, num_queries=10, k=7) -> dsp.Example: example.background = [] example.context = [] for hop in range(max_hops): # Generate queries template = rewrite_template if hop == 0 else hop_template if num_queries == 1: example, completions = dsp.generate(template)(example, stage=f'h{hop}') passages = dsp.retrieve(completions.query, k=k) else: num_queries = int(num_queries) temperature = 0.7 if num_queries > 1 else 0.0 example, completions = dsp.generate(template, n=num_queries, temperature=temperature)(example, stage=f'h{hop}') queries = [c.query for c in completions] + [example.question] passages = dsp.retrieveEnsemble(queries, k=k) # Arrange the passages for the next hop example.context = ([completions[0].rationale] if hop > 0 else []) + example.background + passages example.context = deduplicate(example.context)[:k] example.background = deduplicate(example.background + passages)[:(hop+1)] return example
- Then Demonstrate. With the full program.
def multihop_QA_v3(x, num_queries=1, num_preds=1, num_passages=7, demos_=None): demos = dsp.sample(x.train, k=16) @dsp.transformation def attempt(d: dsp.Example) -> dsp.Example: x = dsp.Example(question=d.question, demos=dsp.all_but(demos, d)) x = multihop_search(x, num_queries=1, k=min(4, num_passages)) if not dsp.passage_match(x.context, d.answer): return None x = QA_predict(x, CoT=True, num_preds=1) if not dsp.answer_match(x.answer, d.answer): return None return d.copy(**x) x.demos = dsp.annotate(attempt)(demos, k=3, return_all=True) if demos_ is None else demos_ x = multihop_search(x, num_queries=num_queries, k=num_passages) x = QA_predict(x, CoT=True, num_preds=num_preds) return x
- Then evaluation: 5x200 examples.
min_seed_to_run = 0 max_seeds_to_run = 5 min_examples_per_seed = 0 max_examples_per_seed = 200 finals = {} def base_n10_demos(x): return multihop_QA_v3(x, num_queries=10, num_preds=20, num_passages=7) program = base_n10_demos total = 0 correct = 0 for seed_idx in range(min_seed_to_run, max_seeds_to_run): this_train_set, this_eval_set = train_sets[seed_idx], eval_sets[seed_idx][min_examples_per_seed:max_examples_per_seed] this_train_set = [dsp.Example(qid=x['qid'], question=x['question'], answer=x['answers']) for x in this_train_set] this_eval_set = [dsp.Example(qid=x['qid'], question=x['question'], answer=x['answers']) for x in this_eval_set] # Evaluate here with this training set and this eval set.
- Before you run on 5x200 examples, which can be expensive, please run on max_seeds_to_run=1 and max_examples_per_seed=50. This will run the 50 examples of 1000. The EM score we get on this tiny subset is 36%. The overall score (on all 1000 examples) is 50%.
Hi, thanks for the amazing work! Could you please share the data you used in the paper and provide more details about how to replicate the results in the paper if possible?