sunyuhan19981208 commented 1 year ago

Description

I am currently attempting to reproduce the experiment results outlined in your paper using the Flan-t5-large model on the BBH task. I've encountered a deviation in the results and I'm seeking assistance in addressing this issue.

Brief summary of my situation

Experiment Details

I utilized the script available at this GitHub repository to conduct the evaluation.

Discrepancy

My obtained score for Flan-t5-large on the BBH task was 31.82, whereas your paper reports a score of 27 for the same model-task combination.

Closure of Cot

To ensure consistency, I made sure to set ntrain=0 as instructed.

Request

I kindly ask for your guidance in reproducing the experiment results as presented in your paper. It would be immensely helpful if you could provide some insights, suggestions, or clarifications to assist me in achieving consistent results.

I greatly appreciate your time and consideration in helping me address this matter. Your expertise will go a long way in resolving this discrepancy and advancing my understanding of the experiment.

Thank you in advance for your assistance!

Log of Evaluation

[2023-08-15 14:55:45,077] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
{'data_dir': 'lukaemon/bbh', 'ntrain': 0, 'kwargs': {'model_name': 'seq_to_seq', 'model_path': '/data/sdb/sunyuhan/exp/flan-t5-large'}, 'args': Namespace(data_dir='lukaemon/bbh', ntrain=0, kwargs={'model_name': 'seq_to_seq', 'model_path': '/data/sdb/sunyuhan/exp/flan-t5-large'}), 'model': SeqToSeqModel(model_path='/data/sdb/sunyuhan/exp/flan-t5-large', max_input_length=2048, max_output_length=32, model=None, tokenizer=None, lora_path='', device='cuda', load_8bit=False, do_sample=False)}
{'prompt': 'not ( True ) and ( True ) is\nAnswer:', 'label': 'False', 'pred': 'not'}
{'name': 'boolean_expressions', 'score': 0.504}
{'prompt': 'How would a typical person answer each of the following questions about causation?\nA machine is set up in such a way that it will short circuit if both the black wire and the red wire touch the battery at the same time. The machine will not short circuit if just one of these wires touches the battery. The black wire is designated as the one that is supposed to touch the battery, while the red wire is supposed to remain in some other part of the machine. One day, the black wire and the red wire both end up touching the battery at the same time. There is a short circuit. Did the black wire cause the short circuit?\nOptions:\n- Yes\n- No\nAnswer:', 'label': 'No', 'pred': 'Yes'}
{'name': 'causal_judgement', 'score': 0.5989304812834224}
{'prompt': 'Today is Christmas Eve of 1937. What is the date tomorrow in MM/DD/YYYY?\nOptions:\n(A) 12/11/1937\n(B) 12/25/1937\n(C) 01/04/1938\n(D) 12/04/1937\n(E) 12/25/2006\n(F) 07/25/1937\nAnswer:', 'label': '(B)', 'pred': '(A)'}
{'name': 'date_understanding', 'score': 0.22}
{'prompt': 'In the following sentences, explain the antecedent of the pronoun (which thing the pronoun refers to), or state that it is ambiguous.\nSentence: The patient was referred to the specialist because he had a rare skin condition.\nOptions:\n(A) The patient had a skin condition\n(B) The specialist had a skin condition\n(C) Ambiguous\nAnswer:', 'label': '(A)', 'pred': 'The patient had a skin condition'}
{'name': 'disambiguation_qa', 'score': 0.0}
{'prompt': 'Complete the rest of the sequence, making sure that the parentheses are closed properly. Input: [ [\nAnswer:', 'label': '] ]', 'pred': ']'}
{'name': 'dyck_languages', 'score': 0.0}
{'prompt': '"Here comes a perfectly valid argument: First of all, whoever is a schoolmate of Sondra is not a stepsister of Pricilla. In consequence, whoever is not a stepsister of Pricilla is a schoolmate of Sondra."\nIs the argument, given the explicitly stated premises, deductively valid or invalid?\nOptions:\n- valid \n- invalid\nAnswer:', 'label': 'invalid', 'pred': 'valid'}
{'name': 'formal_fallacies', 'score': 0.572}
{'prompt': 'This SVG path element <path d="M 55.57,80.69 L 57.38,65.80 M 57.38,65.80 L 48.90,57.46 M 48.90,57.46 L 45.58,47.78 M 45.58,47.78 L 53.25,36.07 L 66.29,48.90 L 78.69,61.09 L 55.57,80.69"/> draws a\nOptions:\n(A) circle\n(B) heptagon\n(C) hexagon\n(D) kite\n(E) line\n(F) octagon\n(G) pentagon\n(H) rectangle\n(I) sector\n(J) triangle\nAnswer:', 'label': '(B)', 'pred': '(A)'}
{'name': 'geometric_shapes', 'score': 0.172}
{'prompt': 'Which sentence has the correct adjective order:\nOptions:\n(A) midsize old grey Brazilian sweater\n(B) midsize grey Brazilian old sweater\nAnswer:', 'label': '(A)', 'pred': '(A) old, grey, sweater'}
{'name': 'hyperbaton', 'score': 0.596}
{'prompt': 'The following paragraphs each describe a set of five objects arranged in a fixed order. The statements are logically consistent within each paragraph. On a branch, there are five birds: a quail, an owl, a raven, a falcon, and a robin. The owl is the leftmost. The robin is to the left of the raven. The quail is the rightmost. The raven is the third from the left.\nOptions:\n(A) The quail is the rightmost\n(B) The owl is the rightmost\n(C) The raven is the rightmost\n(D) The falcon is the rightmost\n(E) The robin is the rightmost\nAnswer:', 'label': '(A)', 'pred': '(A)'}
{'name': 'logical_deduction_five_objects', 'score': 0.452}
{'prompt': 'The following paragraphs each describe a set of seven objects arranged in a fixed order. The statements are logically consistent within each paragraph. In a golf tournament, there were seven golfers: Ana, Eve, Ada, Dan, Rob, Amy, and Joe. Dan finished third. Ana finished above Ada. Amy finished last. Dan finished below Rob. Eve finished below Ada. Rob finished below Joe.\nOptions:\n(A) Ana finished third\n(B) Eve finished third\n(C) Ada finished third\n(D) Dan finished third\n(E) Rob finished third\n(F) Amy finished third\n(G) Joe finished third\nAnswer:', 'label': '(D)', 'pred': '(D)'}
{'name': 'logical_deduction_seven_objects', 'score': 0.456}
{'prompt': 'The following paragraphs each describe a set of three objects arranged in a fixed order. The statements are logically consistent within each paragraph. On a branch, there are three birds: a blue jay, a quail, and a falcon. The falcon is to the right of the blue jay. The blue jay is to the right of the quail.\nOptions:\n(A) The blue jay is the second from the left\n(B) The quail is the second from the left\n(C) The falcon is the second from the left\nAnswer:', 'label': '(A)', 'pred': '(A)'}
{'name': 'logical_deduction_three_objects', 'score': 0.516}
{'prompt': 'Find a movie similar to Batman, The Mask, The Fugitive, Pretty Woman:\nOptions:\n(A) The Front Page\n(B) Maelstrom\n(C) The Lion King\n(D) Lamerica\nAnswer:', 'label': '(C)', 'pred': '(A)'}
{'name': 'movie_recommendation', 'score': 0.56}
{'prompt': '((-1 + 2 + 9 * 5) - (-2 + -4 + -4 * -7)) =\nAnswer:', 'label': '24', 'pred': '((-1 + 3) + 9 * 5) - (-4 + -6) + -23 * -7))'}
{'name': 'multistep_arithmetic_two', 'score': 0.0}
{'prompt': 'If you follow these instructions, do you return to the starting point? Always face forward. Take 1 step backward. Take 9 steps left. Take 2 steps backward. Take 6 steps forward. Take 4 steps forward. Take 4 steps backward. Take 3 steps right.\nOptions:\n- Yes\n- No\nAnswer:', 'label': 'No', 'pred': 'No'}
{'name': 'navigate', 'score': 0.576}
{'prompt': 'I have a flute, a piano, a trombone, four stoves, a violin, an accordion, a clarinet, a drum, two lamps, and a trumpet. How many musical instruments do I have?\nAnswer:', 'label': '8', 'pred': '20'}
{'name': 'object_counting', 'score': 0.288}
{'prompt': 'Here is a table where the first line is a header and each subsequent line is a penguin:  name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15  For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm.  We now add a penguin to the table:\nJames, 12, 90, 12\nAnd here is a similar table, but listing giraffes:\nname, age, height (cm), weight (kg)\nJody, 5, 430, 620\nGladys, 10, 420, 590\nMarian, 2, 310, 410\nDonna, 9, 440, 650\nWhich is the oldest penguin?\nOptions:\n(A) Louis\n(B) Bernard\n(C) Vincent\n(D) Gwen\n(E) James\nAnswer:', 'label': '(E)', 'pred': '(E)'}
{'name': 'penguins_in_a_table', 'score': 0.3424657534246575}
{'prompt': 'On the floor, there is one mauve cat toy, two purple cat toys, three grey cat toys, two mauve notebooks, three grey notebooks, three burgundy cat toys, and one purple notebook. If I remove all the notebooks from the floor, how many grey objects remain on it?\nOptions:\n(A) zero\n(B) one\n(C) two\n(D) three\n(E) four\n(F) five\n(G) six\n(H) seven\n(I) eight\n(J) nine\n(K) ten\n(L) eleven\n(M) twelve\n(N) thirteen\n(O) fourteen\n(P) fifteen\n(Q) sixteen\nAnswer:', 'label': '(D)', 'pred': 'There are 1 mauve cat toy, 2 purple cat toys, 3 grey cat toys, 2 grey notebooks, 3 burgundy cat'}
{'name': 'reasoning_about_colored_objects', 'score': 0.292}
{'prompt': "Which of the following is a humorous edit of this artist or movie name: 'rain man'?\nOptions:\n(A) ruin man\n(B) rains man\n(C) rain men\n(D) rainmman\nAnswer:", 'label': '(A)', 'pred': '(D)'}
{'name': 'ruin_names', 'score': 0.236}
{'prompt': 'The following translations from German to English contain a particular error. That error will be one of the following types: Named Entities: An entity (names, places, locations, etc.) is changed to a different entity. Numerical Values: Numerical values (ordinals or cardinals), dates, and/or units are changed. Modifiers or Adjectives: The modifiers and adjectives pertaining to a noun are changed. Negation or Antonyms: Introduce or remove a negation or change comparatives to their antonyms. Facts: Trivial factual errors not pertaining to the above classes are introduced in the translations. Dropped Content: A significant clause in the translation is removed. Please identify that error.  Source: Karl Borromäus Joseph Fürst von Liechtenstein war ein kaiserlicher Feldmarschall.\nTranslation: Charles Borromeo Joseph Prince of Liechtenstein was an judicial field marshal.\nThe translation contains an error pertaining to\nOptions:\n(A) Modifiers or Adjectives\n(B) Numerical Values\n(C) Negation or Antonyms\n(D) Named Entities\n(E) Dropped Content\n(F) Facts\nAnswer:', 'label': '(A)', 'pred': '(C)'}
{'name': 'salient_translation_error_detection', 'score': 0.352}
{'prompt': "Which statement is sarcastic?\nOptions:\n(A) He's a generous person, trying to promote a charity stream that has raised millions to help kids in need\n(B) He's a terrible person, trying to promote a charity stream that has raised millions to help kids in need\nAnswer:", 'label': '(B)', 'pred': '(B)'}
{'name': 'snarks', 'score': 0.550561797752809}
{'prompt': 'Is the following sentence plausible? "Elias Lindholm beat the buzzer."\nAnswer:', 'label': 'no', 'pred': 'yes'}
{'name': 'sports_understanding', 'score': 0.552}
{'prompt': 'Today, Susan went to the coffee shop. Between what times could they have gone?\nWe know that:\nSusan woke up at 7am.\nLinda saw Susan driving to the water park from 7am to 11am.\nJohn saw Susan buying clothes at the mall from 11am to 12pm.\nJessica saw Susan taking photos near the Eiffel Tower from 12pm to 1pm.\nSteven saw Susan buying lunch at the deli from 1pm to 2pm.\nThomas saw Susan reading at the library from 2pm to 6pm.\nThe coffee shop was closed after 9pm.\nBetween what times could Susan have gone to the coffee shop?\nOptions:\n(A) 6pm to 9pm\n(B) 7am to 11am\n(C) 1pm to 2pm\n(D) 2pm to 6pm\nAnswer:', 'label': '(A)', 'pred': '(A)'}
{'name': 'temporal_sequences', 'score': 0.22}
{'prompt': 'Alice, Bob, Claire, Dave, and Eve are dancers at a square dance. At the start of a song, they each have a partner: Alice is dancing with Patrick, Bob is dancing with Sam, Claire is dancing with Jamie, Dave is dancing with Lola, and Eve is dancing with Melissa.\nThroughout the song, the dancers often trade partners. First, Dave and Eve switch partners. Then, Dave and Alice switch partners. Then, Eve and Alice switch partners. Then, Claire and Bob switch partners. Finally, Dave and Alice switch partners. At the end of the dance, Alice is dancing with\nOptions:\n(A) Patrick\n(B) Sam\n(C) Jamie\n(D) Lola\n(E) Melissa\nAnswer:', 'label': '(A)', 'pred': '(A)'}
{'name': 'tracking_shuffled_objects_five_objects', 'score': 0.12}
{'prompt': 'Alice, Bob, Claire, Dave, Eve, Fred, and Gertrude are on the same team in a soccer match. At the start of the match, they are each assigned to a position: Alice is playing striker, Bob is playing right winger, Claire is playing left winger, Dave is playing benchwarmer, Eve is playing goalkeeper, Fred is playing center midfielder, and Gertrude is playing cheerleader.\nAs the game progresses, pairs of players occasionally swap positions. First, Eve and Claire trade positions. Then, Gertrude and Alice trade positions. Then, Fred and Bob trade positions. Then, Dave and Fred trade positions. Then, Fred and Bob trade positions. Then, Bob and Eve trade positions. Finally, Claire and Alice trade positions. At the end of the match, Gertrude is playing\nOptions:\n(A) striker\n(B) right winger\n(C) left winger\n(D) benchwarmer\n(E) goalkeeper\n(F) center midfielder\n(G) cheerleader\nAnswer:', 'label': '(A)', 'pred': '(G)'}
{'name': 'tracking_shuffled_objects_seven_objects', 'score': 0.084}
{'prompt': 'Alice, Bob, and Claire are friends and avid readers who occasionally trade books. At the start of the semester, they each buy one new book: Alice gets Ulysses, Bob gets Frankenstein, and Claire gets Lolita.\nAs the semester proceeds, they start trading around the new books. First, Claire and Bob swap books. Then, Bob and Alice swap books. Finally, Claire and Bob swap books. At the end of the semester, Bob has\nOptions:\n(A) Ulysses\n(B) Frankenstein\n(C) Lolita\nAnswer:', 'label': '(B)', 'pred': '(B)'}
{'name': 'tracking_shuffled_objects_three_objects', 'score': 0.332}
{'prompt': 'Question: Sherrie tells the truth. Vernell says Sherrie tells the truth. Alexis says Vernell lies. Michaela says Alexis tells the truth. Elanor says Michaela tells the truth. Does Elanor tell the truth?\nAnswer:', 'label': 'No', 'pred': 'yes'}
{'name': 'web_of_lies', 'score': 0.0}
{'prompt': 'Sort the following words alphabetically: List: syndrome therefrom\nAnswer:', 'label': 'syndrome therefrom', 'pred': 'therefrom'}
{'name': 'word_sorting', 'score': 0.0}
{'average': 0.31822066786892184}
{'bbh': 31.82}
bbh: 31.82

SivilTaram commented 1 year ago

@sunyuhan19981208 Hello, thanks for reporting the evaluation inconsistency! I'm not sure why the performance has such a large difference, and I have checked the results per task and have observed that:

The results of logical_deduction_x_objects in your table are much higher than those obtained from my side on FLAN-T5-Large.
The result of hyperbaton is also really high in your reported scores.
The result of web_of_lies is much lower than the ones reported in our paper.

Not sure if these gap are caused by different evaluation methods. In our evaluation script, we just try to generate the result and evaluate the performance using exact match (for all models), without further actions (e.g. removing redundant explanation).

One thing which may have a difference is that we split the BBH benchmark into two sets to study the effect of the number of examples to the performance on BBH in Figure 3. That is to say, our BBH benchmark contains 150 examples for each task to evaluate (and the original one has 250 examples), and we evaluate all models on these 150 examples to have a fair comparison.

I do not think that it will cause such a big difference. And I will update the reproduce procedure soon for your reference.

SivilTaram commented 1 year ago

@sunyuhan19981208 Please follow the reproduce_bbh.py to obtain the zero-shot results of FLAN-T5-Large reported in our paper. Note that we follow the FLAN paper to add the task description before each task input, which shares a slightly different prompt with your shared evaluation script. And also, I'm not sure if the shared repo use fuzzy matching to match the ground-truth and the prediction?

sunyuhan19981208 commented 1 year ago

@sunyuhan19981208 Please follow the reproduce_bbh.py to obtain the zero-shot results of FLAN-T5-Large reported in our paper. Note that we follow the FLAN paper to add the task description before each task input, which shares a slightly different prompt with your shared evaluation script. And also, I'm not sure if the shared repo use fuzzy matching to match the ground-truth and the prediction?

Thanks a lot, and below is the evaluation code in the instruct-eval:

def evaluate(model: EvalModel, data: BBHData, ntrain: int) -> dict:
    data_train = BBHData(samples=data.samples[:ntrain])
    data_test = BBHData(samples=data.samples[ntrain:])
    is_correct = []

    for i in range(len(data_test.samples)):
        # get prompt and make sure it fits
        k = int(ntrain)
        prompt_end = data_test.samples[i].as_prompt(include_answer=False)
        train_prompt = gen_prompt(data_train, k)
        prompt = train_prompt + prompt_end

        while not model.check_valid_length(prompt) and k > 0:
            k -= 1
            train_prompt = gen_prompt(data_train, k)
            prompt = train_prompt + prompt_end

        label = data_test.samples[i].target
        pred = model.run(prompt)
        is_correct.append(pred.strip().startswith(label))
        if i == 0:
            print(dict(prompt=prompt, label=label, pred=pred))

    return dict(score=sum(is_correct) / len(is_correct))

sunyuhan19981208 commented 1 year ago

@SivilTaram I think the main cause is they use startswith and you use ==

SivilTaram commented 1 year ago

@sunyuhan19981208 Okay that makes sense. In fact I do not think startswith is better than == when the mode gives wrong explanation - for example, although they predict the correct option, but they do not give the correct content. For me, if it is a fair comparison, it should reflect the ranking / trend. Since our paper uses the same exact match metric for all models, I think it should be okay to use exact match.

sunyuhan19981208 commented 1 year ago

@SivilTaram Thanks, I will turn to use exact match in the new commit to reproduce your experiment. And also reproduce using startswith, I will tell you the result of startswith after I finish the experiment.

SivilTaram commented 1 year ago

Closed the issue since the problem is solved. Feel free to reopen this issue if you want to report new results.

SivilTaram commented 1 year ago

@sunyuhan19981208 How about the performance? Have you tried to use startswith to evaluate lorahub?

sail-sg / lorahub

Request for Guidance on Reproducing Experiments for BigBenchHard #8

Description

Brief summary of my situation

Experiment Details

Discrepancy

Closure of Cot

Request

Log of Evaluation