stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy-docs.vercel.app/
MIT License
16.65k stars 1.28k forks source link

How to replicate the results in the paper #85

Closed WenzhengZhang closed 1 year ago

WenzhengZhang commented 1 year ago

Hi, thanks for the amazing work! Could you please share the data you used in the paper and provide more details about how to replicate the results in the paper if possible?

okhat commented 1 year ago

Sure thing, which task(s) specifically are you interested in first? I'd already migrated HotPotQA, from our original internal DSP that we used for the paper to the released DSPv1 framework, for other people so it's going to be the easiest.

We are making a v2 release of the framework in a little bit but I can make time for a separate branch with original v1 paper reproduction.

WenzhengZhang commented 1 year ago

Sure thing, which task(s) specifically are you interested in first? I'd already migrated HotPotQA for other people so it's going to be the easiest.

We are making a v2 release of the framework in a little bit but I can make time for a separate branch with original v1 paper reproduction.

Thanks for your reply! I'm interested in the HotPotQA task in first.

okhat commented 1 year ago

Okay, I've shared the following with another person who'd asked a couple of months ago.

The main thing not discussed below is that you need a ColBERTv2 index over the Wikipedia 2017 abstracts dataset from HotPotQA. Do you have this data/index? (You can't use the 2018 full index we're hosting if you want to do a reproduction; it's quite different. You can use it for some quick testing but expect different results if you use a different index.)

Also note that this isn't identical to the paper's runs, because the framework (v1) evolved a lot past what's discussed in the Dec 2022 paper. Lastly, you'll need to make sure to run this from v1 code even after we release v2 (later today!).

We'll maintain v1 on the v1 branch on github.

1) Importing and setting up.

import dsp

davinci_002 = dsp.GPT3(model='text-davinci-002')
rm = dsp.ColBERTv2(port=2017, post_requests=True) # my colbert index of wiki 2017 abstracts
dsp.settings.configure(lm=davinci_002, rm=rm)

2) Preparing the dataset. Please download the official HotPotQA dataset and put the path below.

import random
from dsp import Example

class Dataset:
    def __init__(self, train_seed=0, train_size=None, eval_seed=0, dev_size=None, test_size=None):
        self.train_size = train_size
        self.train_seed = train_seed
        self.dev_size = dev_size
        self.dev_seed = eval_seed
        self.test_size = test_size
        self.test_seed = eval_seed

        self.name = self.__class__.__name__

    @property
    def train(self):
        if not hasattr(self, '_train_'):
            self._train_ = self._shuffle_and_sample('train', self._train, self.train_size, self.train_seed)

        return self._train_

    @property
    def dev(self):
        if not hasattr(self, '_dev_'):
            self._dev_ = self._shuffle_and_sample('dev', self._dev, self.dev_size, self.dev_seed)

        return self._dev_

    @property
    def test(self):
        if not hasattr(self, '_test_'):
            self._test_ = self._shuffle_and_sample('test', self._test, self.test_size, self.test_seed)

        return self._test_

    def _shuffle_and_sample(self, split, data, size, seed=0):
        '''
            The setting (seed=s, size=N) is always a subset
            of the setting (seed=s, size=M) for N < M.
        '''

        data = list(data)

        # Shuffle the data irrespective of the requested size.
        base_rng = random.Random(seed)
        base_rng.shuffle(data)

        data = data[:size]
        output = []

        for example in data:
            output.append(Example(**example, split=split))

        # rng = random.Random(seed)
        # rng.shuffle(data)

        return output

import os
import ujson

hotpotqa_root = '/future/u/okhattab/data/HotPotQA'

class HotPotQA(Dataset):
    def __init__(self, *args, only_hard_examples=True, **kwargs) -> None:
        super().__init__(*args, **kwargs)

        assert only_hard_examples

        with open(os.path.join(hotpotqa_root, 'raw/hotpot_train_v1.1.json')) as g:
            hotpotqa_raw = ujson.load(g)

        official_train = []
        for qid, raw_example in enumerate(hotpotqa_raw):
            if raw_example['level'] == 'hard':
                official_train.append({'qid': qid, 'question': raw_example['question'], 'answers': [raw_example['answer']]})

        rng = random.Random(0)
        rng.shuffle(official_train)

        self._train = official_train[:len(official_train)*90//100]
        self._dev = official_train[len(official_train)*90//100:]

        self._test = []

3) Further prepare the data seeds.

from dsp.utils import dotdict, EM, F1

examples_per_seed = 200
num_seeds = 5

data_args = dotdict(train_seed=1, train_size=16, eval_seed=2023, dev_size=examples_per_seed * num_seeds, test_size=0)
dataset = HotPotQA(**data_args)
eval_set = dataset.dev
eval_offset = 0

eval_sets = []
train_sets = []

for train_seed in range(1, 1+num_seeds):
    data_args.train_seed = train_seed
    dataset = HotPotQA(**data_args)
    train_set = dataset.train

    eval_sets.append(eval_set[eval_offset:eval_offset+examples_per_seed])
    train_sets.append(train_set)

    assert len(eval_sets[-1]) == examples_per_seed, len(eval_sets[-1])
    assert len(train_sets[-1]) == 16

    eval_offset += examples_per_seed

4) Check that the data matches ours.

eval_sets[3][-2]

# Should print:
# {'qid': 41087,  'question': 'Which one of the Beatles died before the release of Somewhere in England in 1981?',  'answers': ['John Lennon'],  'split': 'dev'}

5) Use the following program. This isn't identical to the templates used in the DSP program from the paper but it's very close. It's basically the multi-hop DSP program from the compiler notebook. I checked it gets 50.3% EM. First, the Predict stage.

Question = dsp.Type(prefix="Question:", desc="${the question to be answered}")
Context = dsp.Type(prefix="Context:\n", desc="${sources that may contain relevant content}", format=dsp.passages2text)

Answer = dsp.Type(prefix="Answer:", desc="${a short factoid answer, often between 1 word and 5 words}", format=dsp.format_answers)
Rationale = dsp.Type(prefix="Rationale: Let's think step by step.", desc="${reasoning}")

qa_template = dsp.Template(instructions="Answer questions with short factoid answers.", question=Question(), answer=Answer())
qa_template_with_CoT = dsp.Template(instructions=qa_template.instructions, context=Context(), question=Question(), rationale=Rationale(), answer=Answer())

@dsp.transformation
def QA_predict(example: dsp.Example, CoT=False, num_preds=1) -> dsp.Example:
    template = qa_template_with_CoT

    if num_preds > 1:
        assert CoT, CoT

        example, completions = dsp.generate(template, n=num_preds, temperature=0.7)(example, stage='qa')
        example.completions = completions
        completions = dsp.majority(completions)

    else:
        example, completions = dsp.generate(template)(example, stage='qa')

    return example.copy(answer=completions.answer)

6) Then Search.

SearchQuery = dsp.Type(prefix="Search Query:", desc="${a simple question for seeking missing information}")
SearchRationale = dsp.Type(prefix="Rationale: Let's think step by step. To answer this question, we first need to find out", desc="${the missing information}")
CondenseRationale = dsp.Type(prefix="Rationale: Let's think step by step. Based on the context, we have learned the following.", desc="${information from the context that provides useful clues}")

rewrite_template = dsp.Template(instructions="Write a search query that will help answer a complex question.", question=Question(), rationale=SearchRationale(), query=SearchQuery())
hop_template = dsp.Template(instructions=rewrite_template.instructions, context=Context(), question=Question(), rationale=CondenseRationale(), query=SearchQuery())

from dsp.utils import deduplicate

@dsp.transformation
def multihop_search(example: dsp.Example, max_hops=2, num_queries=10, k=7) -> dsp.Example:
    example.background = []
    example.context = []

    for hop in range(max_hops):
        # Generate queries
        template = rewrite_template if hop == 0 else hop_template

        if num_queries == 1:
            example, completions = dsp.generate(template)(example, stage=f'h{hop}')
            passages = dsp.retrieve(completions.query, k=k)

        else:
            num_queries = int(num_queries)
            temperature = 0.7 if num_queries > 1 else 0.0

            example, completions = dsp.generate(template, n=num_queries, temperature=temperature)(example, stage=f'h{hop}')
            queries = [c.query for c in completions] + [example.question]
            passages = dsp.retrieveEnsemble(queries, k=k)

        # Arrange the passages for the next hop
        example.context = ([completions[0].rationale] if hop > 0 else []) + example.background + passages

        example.context = deduplicate(example.context)[:k]
        example.background = deduplicate(example.background + passages)[:(hop+1)]

    return example

7) Then Demonstrate. With the full program.

def multihop_QA_v3(x, num_queries=1, num_preds=1, num_passages=7, demos_=None):
    demos = dsp.sample(x.train, k=16)

    @dsp.transformation
    def attempt(d: dsp.Example) -> dsp.Example:
        x = dsp.Example(question=d.question, demos=dsp.all_but(demos, d))
        x = multihop_search(x, num_queries=1, k=min(4, num_passages))

        if not dsp.passage_match(x.context, d.answer): return None
        x = QA_predict(x, CoT=True, num_preds=1)

        if not dsp.answer_match(x.answer, d.answer): return None
        return d.copy(**x)

    x.demos = dsp.annotate(attempt)(demos, k=3, return_all=True) if demos_ is None else demos_

    x = multihop_search(x, num_queries=num_queries, k=num_passages)
    x = QA_predict(x, CoT=True, num_preds=num_preds)
    return x

8) Then evaluation: 5x200 examples.

min_seed_to_run = 0
max_seeds_to_run = 5
min_examples_per_seed = 0
max_examples_per_seed = 200
finals = {}

def base_n10_demos(x):
    return multihop_QA_v3(x, num_queries=10, num_preds=20, num_passages=7)

program = base_n10_demos
total = 0
correct = 0

for seed_idx in range(min_seed_to_run, max_seeds_to_run):
    this_train_set, this_eval_set = train_sets[seed_idx], eval_sets[seed_idx][min_examples_per_seed:max_examples_per_seed]

    this_train_set = [dsp.Example(qid=x['qid'], question=x['question'], answer=x['answers']) for x in this_train_set]
    this_eval_set = [dsp.Example(qid=x['qid'], question=x['question'], answer=x['answers']) for x in this_eval_set]

    # Evaluate here with this training set and this eval set.

9) Before you run on 5x200 examples, which can be expensive, please run on max_seeds_to_run=1 and max_examples_per_seed=50. This will run the 50 examples of 1000. The EM score we get on this tiny subset is 36%. The overall score (on all 1000 examples) is 50%.

WenzhengZhang commented 1 year ago

Thank you so much for providing the details! I don't have the ColBERTv2 index over the Wikipedia 2017. I'll use your hosting index for a quick test first.

okhat commented 1 year ago

Sure, but FYI the 2018 passages are considerably longer and otherwise misaligned with HotPotQA (in terms of date and scope). Don't rely on them too much.

xschen-beb commented 6 days ago

Sorry, I ran this code, but it shows error AttributeError: module 'dsp' has no attribute 'annotate' AttributeError: module 'dsp' has no attribute 'transformation'

Okay, I've shared the following with another person who'd asked a couple of months ago.

The main thing not discussed below is that you need a ColBERTv2 index over the Wikipedia 2017 abstracts dataset from HotPotQA. Do you have this data/index? (You can't use the 2018 full index we're hosting if you want to do a reproduction; it's quite different. You can use it for some quick testing but expect different results if you use a different index.)

Also note that this isn't identical to the paper's runs, because the framework (v1) evolved a lot past what's discussed in the Dec 2022 paper. Lastly, you'll need to make sure to run this from v1 code even after we release v2 (later today!).

We'll maintain v1 on the v1 branch on github.

  1. Importing and setting up.
import dsp

davinci_002 = dsp.GPT3(model='text-davinci-002')
rm = dsp.ColBERTv2(port=2017, post_requests=True) # my colbert index of wiki 2017 abstracts
dsp.settings.configure(lm=davinci_002, rm=rm)
  1. Preparing the dataset. Please download the official HotPotQA dataset and put the path below.
import random
from dsp import Example

class Dataset:
    def __init__(self, train_seed=0, train_size=None, eval_seed=0, dev_size=None, test_size=None):
        self.train_size = train_size
        self.train_seed = train_seed
        self.dev_size = dev_size
        self.dev_seed = eval_seed
        self.test_size = test_size
        self.test_seed = eval_seed

        self.name = self.__class__.__name__

    @property
    def train(self):
        if not hasattr(self, '_train_'):
            self._train_ = self._shuffle_and_sample('train', self._train, self.train_size, self.train_seed)

        return self._train_

    @property
    def dev(self):
        if not hasattr(self, '_dev_'):
            self._dev_ = self._shuffle_and_sample('dev', self._dev, self.dev_size, self.dev_seed)

        return self._dev_

    @property
    def test(self):
        if not hasattr(self, '_test_'):
            self._test_ = self._shuffle_and_sample('test', self._test, self.test_size, self.test_seed)

        return self._test_

    def _shuffle_and_sample(self, split, data, size, seed=0):
        '''
            The setting (seed=s, size=N) is always a subset
            of the setting (seed=s, size=M) for N < M.
        '''

        data = list(data)

        # Shuffle the data irrespective of the requested size.
        base_rng = random.Random(seed)
        base_rng.shuffle(data)

        data = data[:size]
        output = []

        for example in data:
            output.append(Example(**example, split=split))

        # rng = random.Random(seed)
        # rng.shuffle(data)

        return output

import os
import ujson

hotpotqa_root = '/future/u/okhattab/data/HotPotQA'

class HotPotQA(Dataset):
    def __init__(self, *args, only_hard_examples=True, **kwargs) -> None:
        super().__init__(*args, **kwargs)

        assert only_hard_examples

        with open(os.path.join(hotpotqa_root, 'raw/hotpot_train_v1.1.json')) as g:
            hotpotqa_raw = ujson.load(g)

        official_train = []
        for qid, raw_example in enumerate(hotpotqa_raw):
            if raw_example['level'] == 'hard':
                official_train.append({'qid': qid, 'question': raw_example['question'], 'answers': [raw_example['answer']]})

        rng = random.Random(0)
        rng.shuffle(official_train)

        self._train = official_train[:len(official_train)*90//100]
        self._dev = official_train[len(official_train)*90//100:]

        self._test = []
  1. Further prepare the data seeds.
from dsp.utils import dotdict, EM, F1

examples_per_seed = 200
num_seeds = 5

data_args = dotdict(train_seed=1, train_size=16, eval_seed=2023, dev_size=examples_per_seed * num_seeds, test_size=0)
dataset = HotPotQA(**data_args)
eval_set = dataset.dev
eval_offset = 0

eval_sets = []
train_sets = []

for train_seed in range(1, 1+num_seeds):
    data_args.train_seed = train_seed
    dataset = HotPotQA(**data_args)
    train_set = dataset.train

    eval_sets.append(eval_set[eval_offset:eval_offset+examples_per_seed])
    train_sets.append(train_set)

    assert len(eval_sets[-1]) == examples_per_seed, len(eval_sets[-1])
    assert len(train_sets[-1]) == 16

    eval_offset += examples_per_seed
  1. Check that the data matches ours.
eval_sets[3][-2]

# Should print:
# {'qid': 41087,  'question': 'Which one of the Beatles died before the release of Somewhere in England in 1981?',  'answers': ['John Lennon'],  'split': 'dev'}
  1. Use the following program. This isn't identical to the templates used in the DSP program from the paper but it's very close. It's basically the multi-hop DSP program from the compiler notebook. I checked it gets 50.3% EM. First, the Predict stage.
Question = dsp.Type(prefix="Question:", desc="${the question to be answered}")
Context = dsp.Type(prefix="Context:\n", desc="${sources that may contain relevant content}", format=dsp.passages2text)

Answer = dsp.Type(prefix="Answer:", desc="${a short factoid answer, often between 1 word and 5 words}", format=dsp.format_answers)
Rationale = dsp.Type(prefix="Rationale: Let's think step by step.", desc="${reasoning}")

qa_template = dsp.Template(instructions="Answer questions with short factoid answers.", question=Question(), answer=Answer())
qa_template_with_CoT = dsp.Template(instructions=qa_template.instructions, context=Context(), question=Question(), rationale=Rationale(), answer=Answer())

@dsp.transformation
def QA_predict(example: dsp.Example, CoT=False, num_preds=1) -> dsp.Example:
    template = qa_template_with_CoT

    if num_preds > 1:
        assert CoT, CoT

        example, completions = dsp.generate(template, n=num_preds, temperature=0.7)(example, stage='qa')
        example.completions = completions
        completions = dsp.majority(completions)

    else:
        example, completions = dsp.generate(template)(example, stage='qa')

    return example.copy(answer=completions.answer)
  1. Then Search.
SearchQuery = dsp.Type(prefix="Search Query:", desc="${a simple question for seeking missing information}")
SearchRationale = dsp.Type(prefix="Rationale: Let's think step by step. To answer this question, we first need to find out", desc="${the missing information}")
CondenseRationale = dsp.Type(prefix="Rationale: Let's think step by step. Based on the context, we have learned the following.", desc="${information from the context that provides useful clues}")

rewrite_template = dsp.Template(instructions="Write a search query that will help answer a complex question.", question=Question(), rationale=SearchRationale(), query=SearchQuery())
hop_template = dsp.Template(instructions=rewrite_template.instructions, context=Context(), question=Question(), rationale=CondenseRationale(), query=SearchQuery())

from dsp.utils import deduplicate

@dsp.transformation
def multihop_search(example: dsp.Example, max_hops=2, num_queries=10, k=7) -> dsp.Example:
    example.background = []
    example.context = []

    for hop in range(max_hops):
        # Generate queries
        template = rewrite_template if hop == 0 else hop_template

        if num_queries == 1:
            example, completions = dsp.generate(template)(example, stage=f'h{hop}')
            passages = dsp.retrieve(completions.query, k=k)

        else:
            num_queries = int(num_queries)
            temperature = 0.7 if num_queries > 1 else 0.0

            example, completions = dsp.generate(template, n=num_queries, temperature=temperature)(example, stage=f'h{hop}')
            queries = [c.query for c in completions] + [example.question]
            passages = dsp.retrieveEnsemble(queries, k=k)

        # Arrange the passages for the next hop
        example.context = ([completions[0].rationale] if hop > 0 else []) + example.background + passages

        example.context = deduplicate(example.context)[:k]
        example.background = deduplicate(example.background + passages)[:(hop+1)]

    return example
  1. Then Demonstrate. With the full program.
def multihop_QA_v3(x, num_queries=1, num_preds=1, num_passages=7, demos_=None):
    demos = dsp.sample(x.train, k=16)

    @dsp.transformation
    def attempt(d: dsp.Example) -> dsp.Example:
        x = dsp.Example(question=d.question, demos=dsp.all_but(demos, d))
        x = multihop_search(x, num_queries=1, k=min(4, num_passages))

        if not dsp.passage_match(x.context, d.answer): return None
        x = QA_predict(x, CoT=True, num_preds=1)

        if not dsp.answer_match(x.answer, d.answer): return None
        return d.copy(**x)

    x.demos = dsp.annotate(attempt)(demos, k=3, return_all=True) if demos_ is None else demos_

    x = multihop_search(x, num_queries=num_queries, k=num_passages)
    x = QA_predict(x, CoT=True, num_preds=num_preds)
    return x
  1. Then evaluation: 5x200 examples.
min_seed_to_run = 0
max_seeds_to_run = 5
min_examples_per_seed = 0
max_examples_per_seed = 200
finals = {}

def base_n10_demos(x):
    return multihop_QA_v3(x, num_queries=10, num_preds=20, num_passages=7)

program = base_n10_demos
total = 0
correct = 0

for seed_idx in range(min_seed_to_run, max_seeds_to_run):
    this_train_set, this_eval_set = train_sets[seed_idx], eval_sets[seed_idx][min_examples_per_seed:max_examples_per_seed]

    this_train_set = [dsp.Example(qid=x['qid'], question=x['question'], answer=x['answers']) for x in this_train_set]
    this_eval_set = [dsp.Example(qid=x['qid'], question=x['question'], answer=x['answers']) for x in this_eval_set]

    # Evaluate here with this training set and this eval set.
  1. Before you run on 5x200 examples, which can be expensive, please run on max_seeds_to_run=1 and max_examples_per_seed=50. This will run the 50 examples of 1000. The EM score we get on this tiny subset is 36%. The overall score (on all 1000 examples) is 50%.