Open pldlgb opened 1 year ago
me too, have you fixed it?
The problem is that the multiple_choice_separate
method should compute logprob to compute em. The current method does not return logprob, and the logprob is set incorrectly.
This may be due to an issue with the API. When using multiple_choice_separate, the echo_prompt parameter in the request should be set to True, and the API should return the logprob of each token in the prompt. The prompt includes the context + answer, and in the metric section, the logprob of the answer part is extracted and compared with the logprob of other options to obtain the final selection result.
In OpenAI's API, returning the logprob of the prompt is supported by setting prompt=True, so fixing this error requires improving the API.
My approach is to change model.generate()
to use model.forward()
to get all the logprobs, and change the calculation method of logprob this . Finally, after the revision, I can get the close results in the paper of llama7b-65b.
That's right.
---- Replied Message ---- | From | @.> | | Date | 05/15/2023 19:14 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [stanford-crfm/helm] Hellaswag and Openbookqa Accuracy EM==1? (Issue #1497) |
My approach is to change model.generate() to use model.forward() to get all the logprobs, and change the calculation method of logprob this . Finally, after the revision, I can get the close results in the paper of llama7b-65b.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
That's great! @lumosity4tpj if you have a fork or PR I could look at that uses model.forward()
, I would love to merge that upstream so that we can get the logprobs. cc @julian-q
That's great! @lumosity4tpj if you have a fork or PR I could look at that uses
model.forward()
, I would love to merge that upstream so that we can get the logprobs. cc @julian-q
I will try to pr my code later, although I don't think it is elegant enough.
Any update?
@lumosity4tpj Any progress on this?
I think that it would be a good idea in the meantime to register all Hugging Face models as LIMITED_FUNCTIONALITY_TEXT_MODEL_TAG
rather than FULL_FUNCTIONALITY_TEXT_MODEL_TAG
here when added via the --enable-huggingface-models
flag. This will cause helm-run
to skip the problematic scenarios (Hellaswag and Openbookqa) for now. Do you have any thoughts on this?
@yifanmai @pldlgb I changed huggingface_client.py#L80-L83
if relevant_raw_request["max_new_tokens"] == 0:
output = self.model(encoded_input["input_ids"])
sequences = encoded_input["input_ids"]
scores = output.logits
else:
output = self.model.generate(**encoded_input, **relevant_raw_request)
sequences = output.sequences
scores = output.scores
and huggingface_client.py#L86-L107
all_logprobs_of_chosen_tokens = []
all_top_logprobs_dicts = []
if relevant_raw_request["max_new_tokens"] == 0 and raw_request["echo_prompt"]:
for completion_id in range(raw_request["num_return_sequences"]):
logprobs_of_chosen_tokens = []
top_logprobs_dicts = []
for i in range(len(sequences[completion_id]) - 1):
logprobs = torch.nn.functional.log_softmax(scores[completion_id][i], dim=0)
topk_logprobs = torch.topk(logprobs, k=top_k_per_token)
top_logprobs_dicts.append({self.tokenizer.convert_ids_to_tokens(k.item()): v.item()
for (k, v) in zip(topk_logprobs.indices, topk_logprobs.values)})
logprobs_of_chosen_tokens.append(logprobs[sequences[completion_id][i + 1]].item())
all_logprobs_of_chosen_tokens.append(logprobs_of_chosen_tokens)
all_top_logprobs_dicts.append(top_logprobs_dicts)
else:
for completion_id in range(raw_request["num_return_sequences"]):
logprobs_of_chosen_tokens = []
top_logprobs_dicts = []
for i in range(len(sequences[completion_id]) - len(encoded_input.input_ids[0])):
logprobs = torch.nn.functional.log_softmax(scores[i][completion_id], dim=0)
# Get top tokens in terms of log probability.
topk_logprobs = torch.topk(logprobs, k=top_k_per_token)
top_logprobs_dicts.append({self.tokenizer.convert_ids_to_tokens(k.item()): v.item()
for (k, v) in zip(topk_logprobs.indices, topk_logprobs.values)})
j = i + len(encoded_input.input_ids[0])
logprobs_of_chosen_tokens.append(logprobs[sequences[completion_id][j]].item())
all_logprobs_of_chosen_tokens.append(logprobs_of_chosen_tokens)
all_top_logprobs_dicts.append(top_logprobs_dicts)
Compared with the code of the original, I added a branch, the main reason to see: https://github.com/stanford-crfm/helm/issues/1497#issuecomment-1547661957
It's best to check the code because my code is based on an older version.
Awesome, thanks! Feel free to open a PR for this.
I haven't tried this out yet, but I'm curious about what the performance impact of this change, if any.
Thank you for sharing. Just one minor thing to add, we may need with torch.no_grad()
to reduce GPU memory usage when switching from model.generate
to model.forward
. Without this, I immediately got OOM on some datasets.
Here's another user's branch commit that does something similar to get logprobs.
This is raw input/output to/from Titan: TITAN PAYLOAD: {'model': 'amazon.titan-tg1-large', 'inputText': 'Personal Care and Style: [header] How to dye your hair with semi permanent hair dye [title] Find the color you want. [step] There are many popular brands and hundreds of different colors to choose from. Semi-permanent dyes can be found in a variety of places, ranging from grocery stores to specialized fashion shops, with the biggest selection at beauty supply stores. It is important to select the color that represents your hair type when you register your hair color. [substeps] Traditional semi-permanent dyes will generally not be available for hair color, like blow-dryers, curling irons, and appliances.', 'textGenerationConfig': {'maxTokenCount': 512, 'temperature': 0.0, 'topP': 0.9}} {'statusCode': 200, 'body': '{"inputTextTokenCount": 127, "results": [{"tokenCount": 33, "outputText": "\nThe color of your hair will determine the best color for you. For example, if you have a dark brown hair, you can\'t have a light pink.", "completionReason": "FINISH"}]}'}
Question: what is the right way to interrogate an LLM to obtain correct helloswag EM measurements?
OpenBookQA questions don't format properly to LLM prompt of Amazon Bedrock-Titan: {description: "commonsense:model=http/gpt2,dataset=openbookqa,method=multiple_choice_separate_calibrated,data_augmentation=canonical", priority: 2}
Calling APIGW███████████████████████▌ | 6/24 [00:19<00:56, 3.13s/it] TITAN PAYLOAD: {'model': 'amazon.titan-tg1-large', 'inputText': 'What is a more comfortable color to have for your automobile upholstery if living in a desert? navy', 'textGenerationConfig': {'maxTokenCount': 512, 'temperature': 0.0, 'topP': 0.9}} {'statusCode': 200, 'body': '{"inputTextTokenCount": 19, "results": [{"tokenCount": 1, "outputText": " blue", "completionReason": "FINISH"}]}'}
Calling APIGW████████████████████████████▊ | 7/24 [00:21<00:44, 2.59s/it] TITAN PAYLOAD: {'model': 'amazon.titan-tg1-large', 'inputText': 'Answer: navy', 'textGenerationConfig': {'maxTokenCount': 512, 'temperature': 0.0, 'topP': 0.9}} {'statusCode': 200, 'body': '{"inputTextTokenCount": 3, "results": [{"tokenCount": 10, "outputText": "\n\nThe correct answer is option (a).", "completionReason": "FINISH"}]}'}
Calling APIGW██████████████████████████████████ | 8/24 [00:23<00:40, 2.53s/it] TITAN PAYLOAD: {'model': 'amazon.titan-tg1-large', 'inputText': "what's a more comfortable color to have for your automobile upholstery if living in a desert? ecru", 'textGenerationConfig': {'maxTokenCount': 512, 'temperature': 0.0, 'topP': 0.9}} {'statusCode': 200, 'body': '{"inputTextTokenCount": 29, "results": [{"tokenCount": 2, "outputText": " or sand", "completionReason": "FINISH"}]}'}
{description:"commonsense:model=huggingface/gpt2,dataset=hellaswag,method=multiple_choice_separate_original,data_augmentation=canonical", priority: 1}
Thank you for your great work. According to the instructions provided in run_specs.conf, I was able to achieve 100% accuracy when reproducing the two core scenarios of Hellaswag and OpenBookQA. I think there may be some problem somewhere, as I checked the results generated by the model and found that it always generates content for option A. I don't know what went wrong. Could you please give me some assistance?