stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
https://crfm.stanford.edu/helm
Apache License 2.0
1.8k stars 239 forks source link

Hellaswag and Openbookqa Accuracy EM==1? #1497

Open pldlgb opened 1 year ago

pldlgb commented 1 year ago

{description:"commonsense:model=huggingface/gpt2,dataset=hellaswag,method=multiple_choice_separate_original,data_augmentation=canonical", priority: 1} Thank you for your great work. According to the instructions provided in run_specs.conf, I was able to achieve 100% accuracy when reproducing the two core scenarios of Hellaswag and OpenBookQA. I think there may be some problem somewhere, as I checked the results generated by the model and found that it always generates content for option A. I don't know what went wrong. Could you please give me some assistance?

dongZheX commented 1 year ago

me too, have you fixed it?

lumosity4tpj commented 1 year ago

The problem is that the multiple_choice_separate method should compute logprob to compute em. The current method does not return logprob, and the logprob is set incorrectly.

dongZheX commented 1 year ago

This may be due to an issue with the API. When using multiple_choice_separate, the echo_prompt parameter in the request should be set to True, and the API should return the logprob of each token in the prompt. The prompt includes the context + answer, and in the metric section, the logprob of the answer part is extracted and compared with the logprob of other options to obtain the final selection result.

In OpenAI's API, returning the logprob of the prompt is supported by setting prompt=True, so fixing this error requires improving the API.

lumosity4tpj commented 1 year ago

My approach is to change model.generate() to use model.forward() to get all the logprobs, and change the calculation method of logprob this . Finally, after the revision, I can get the close results in the paper of llama7b-65b.

dongZheX commented 1 year ago

That's right.

---- Replied Message ---- | From | @.> | | Date | 05/15/2023 19:14 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [stanford-crfm/helm] Hellaswag and Openbookqa Accuracy EM==1? (Issue #1497) |

My approach is to change model.generate() to use model.forward() to get all the logprobs, and change the calculation method of logprob this . Finally, after the revision, I can get the close results in the paper of llama7b-65b.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

yifanmai commented 1 year ago

That's great! @lumosity4tpj if you have a fork or PR I could look at that uses model.forward(), I would love to merge that upstream so that we can get the logprobs. cc @julian-q

lumosity4tpj commented 1 year ago

That's great! @lumosity4tpj if you have a fork or PR I could look at that uses model.forward(), I would love to merge that upstream so that we can get the logprobs. cc @julian-q

I will try to pr my code later, although I don't think it is elegant enough.

pldlgb commented 1 year ago

Any update?

yifanmai commented 1 year ago

@lumosity4tpj Any progress on this?

I think that it would be a good idea in the meantime to register all Hugging Face models as LIMITED_FUNCTIONALITY_TEXT_MODEL_TAG rather than FULL_FUNCTIONALITY_TEXT_MODEL_TAG here when added via the --enable-huggingface-models flag. This will cause helm-run to skip the problematic scenarios (Hellaswag and Openbookqa) for now. Do you have any thoughts on this?

lumosity4tpj commented 1 year ago

@yifanmai @pldlgb I changed huggingface_client.py#L80-L83

if relevant_raw_request["max_new_tokens"] == 0:
    output = self.model(encoded_input["input_ids"])
    sequences = encoded_input["input_ids"]
    scores = output.logits
else:
    output = self.model.generate(**encoded_input, **relevant_raw_request)
    sequences = output.sequences
    scores = output.scores

and huggingface_client.py#L86-L107

all_logprobs_of_chosen_tokens = []
all_top_logprobs_dicts = []
if relevant_raw_request["max_new_tokens"] == 0 and raw_request["echo_prompt"]:
    for completion_id in range(raw_request["num_return_sequences"]):
        logprobs_of_chosen_tokens = []
        top_logprobs_dicts = []
        for i in range(len(sequences[completion_id]) - 1):
            logprobs = torch.nn.functional.log_softmax(scores[completion_id][i], dim=0)
            topk_logprobs = torch.topk(logprobs, k=top_k_per_token)
            top_logprobs_dicts.append({self.tokenizer.convert_ids_to_tokens(k.item()): v.item()
                                    for (k, v) in zip(topk_logprobs.indices, topk_logprobs.values)})
            logprobs_of_chosen_tokens.append(logprobs[sequences[completion_id][i + 1]].item())
        all_logprobs_of_chosen_tokens.append(logprobs_of_chosen_tokens)
        all_top_logprobs_dicts.append(top_logprobs_dicts)
else:
    for completion_id in range(raw_request["num_return_sequences"]):
        logprobs_of_chosen_tokens = []
        top_logprobs_dicts = []
        for i in range(len(sequences[completion_id]) - len(encoded_input.input_ids[0])):
            logprobs = torch.nn.functional.log_softmax(scores[i][completion_id], dim=0)
            # Get top tokens in terms of log probability.
            topk_logprobs = torch.topk(logprobs, k=top_k_per_token)
            top_logprobs_dicts.append({self.tokenizer.convert_ids_to_tokens(k.item()): v.item() 
                                            for (k, v) in zip(topk_logprobs.indices, topk_logprobs.values)})
            j = i + len(encoded_input.input_ids[0])
            logprobs_of_chosen_tokens.append(logprobs[sequences[completion_id][j]].item())
        all_logprobs_of_chosen_tokens.append(logprobs_of_chosen_tokens)
        all_top_logprobs_dicts.append(top_logprobs_dicts)

Compared with the code of the original, I added a branch, the main reason to see: https://github.com/stanford-crfm/helm/issues/1497#issuecomment-1547661957

It's best to check the code because my code is based on an older version.

yifanmai commented 1 year ago

Awesome, thanks! Feel free to open a PR for this.

I haven't tried this out yet, but I'm curious about what the performance impact of this change, if any.

enor2017 commented 10 months ago

Thank you for sharing. Just one minor thing to add, we may need with torch.no_grad() to reduce GPU memory usage when switching from model.generate to model.forward. Without this, I immediately got OOM on some datasets.

yifanmai commented 9 months ago

Here's another user's branch commit that does something similar to get logprobs.

sermolin commented 9 months ago

I tried running hellaswag against Amazon Bedrock Titan. I consistently get EM=0.25. This is HELM result output: Instance id44874 [split: valid] Input: Personal Care and Style: [header] How to dye your hair with semi permanent hair dye [title] Find the color you want. [step] There are many popular brands and hundreds of different colors to choose from. Semi-permanent dyes can be found in a variety of places, ranging from grocery stores to specialized fashion shops, with the biggest selection at beauty supply stores. References: It is important to select the color that represents your hair type when you register your hair color. [substeps] Traditional semi-permanent dyes will generally not be available for hair color, like blow-dryers, curling irons, and appliances. If you're not planning on dying your hair, there are other coloration measures you can take to dye your hair. [step] Photoshop hd darkers work well, but don't lack the style that can be coupled with it. Pick the color that's your favorite, matches your wardrobe best, and/or is most flattering for your eye color and skin tone. Semi-permanent dyes work on all hair colors, but show up brightest on light hair. [correct] However, you can also take your color, added color, and texture into account when deciding what to dye, and what you will use it for. [substeps] Consider adding your hair dye to your hair if you have it long or curly. [ Exact match: 0.25 | # train: 0 | truncated: 0 | # prompt tokens: 128 | # output tokens: 0 | # trials: 1 ] Prediction{trial 0}: The color of your hair will de ...(74 characters)... , you can't have a light pink.

This is raw input/output to/from Titan: TITAN PAYLOAD: {'model': 'amazon.titan-tg1-large', 'inputText': 'Personal Care and Style: [header] How to dye your hair with semi permanent hair dye [title] Find the color you want. [step] There are many popular brands and hundreds of different colors to choose from. Semi-permanent dyes can be found in a variety of places, ranging from grocery stores to specialized fashion shops, with the biggest selection at beauty supply stores. It is important to select the color that represents your hair type when you register your hair color. [substeps] Traditional semi-permanent dyes will generally not be available for hair color, like blow-dryers, curling irons, and appliances.', 'textGenerationConfig': {'maxTokenCount': 512, 'temperature': 0.0, 'topP': 0.9}} {'statusCode': 200, 'body': '{"inputTextTokenCount": 127, "results": [{"tokenCount": 33, "outputText": "\nThe color of your hair will determine the best color for you. For example, if you have a dark brown hair, you can\'t have a light pink.", "completionReason": "FINISH"}]}'}

Question: what is the right way to interrogate an LLM to obtain correct helloswag EM measurements?

sermolin commented 9 months ago

OpenBookQA questions don't format properly to LLM prompt of Amazon Bedrock-Titan: {description: "commonsense:model=http/gpt2,dataset=openbookqa,method=multiple_choice_separate_calibrated,data_augmentation=canonical", priority: 2}

Calling APIGW███████████████████████▌ | 6/24 [00:19<00:56, 3.13s/it] TITAN PAYLOAD: {'model': 'amazon.titan-tg1-large', 'inputText': 'What is a more comfortable color to have for your automobile upholstery if living in a desert? navy', 'textGenerationConfig': {'maxTokenCount': 512, 'temperature': 0.0, 'topP': 0.9}} {'statusCode': 200, 'body': '{"inputTextTokenCount": 19, "results": [{"tokenCount": 1, "outputText": " blue", "completionReason": "FINISH"}]}'}

Calling APIGW████████████████████████████▊ | 7/24 [00:21<00:44, 2.59s/it] TITAN PAYLOAD: {'model': 'amazon.titan-tg1-large', 'inputText': 'Answer: navy', 'textGenerationConfig': {'maxTokenCount': 512, 'temperature': 0.0, 'topP': 0.9}} {'statusCode': 200, 'body': '{"inputTextTokenCount": 3, "results": [{"tokenCount": 10, "outputText": "\n\nThe correct answer is option (a).", "completionReason": "FINISH"}]}'}

Calling APIGW██████████████████████████████████ | 8/24 [00:23<00:40, 2.53s/it] TITAN PAYLOAD: {'model': 'amazon.titan-tg1-large', 'inputText': "what's a more comfortable color to have for your automobile upholstery if living in a desert? ecru", 'textGenerationConfig': {'maxTokenCount': 512, 'temperature': 0.0, 'topP': 0.9}} {'statusCode': 200, 'body': '{"inputTextTokenCount": 29, "results": [{"tokenCount": 2, "outputText": " or sand", "completionReason": "FINISH"}]}'}