junyangwang0410 / AMBER

An LLM-free Multi-dimensional Benchmark for Multi-modal Hallucination Evaluation
Apache License 2.0
86 stars 2 forks source link

how to process model's response to "YES"/"No" #2

Open workmistm opened 1 week ago

workmistm commented 1 week ago

For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?

junyangwang0410 commented 1 week ago

For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?

The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values ​​of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.

workmistm commented 1 week ago

For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?

The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values ​​of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.

so we need refactor the query for the discriminative task? I think this will introduce error when comparing with other models. so what is your method to compare with these models?

junyangwang0410 commented 6 days ago

For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?

The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values ​​of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.

so we need refactor the query for the discriminative task? I think this will introduce error when comparing with other models. so what is your method to compare with these models?

If you use the method of comparing logits, you do not need to refactor the query. This is how we evaluate open-source models in paper. For black box models, such as GPT-4V, we experimentally found that adding a few queries that restrict output did not significantly impact performance.

workmistm commented 19 hours ago

For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?

The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values ​​of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.

so we need refactor the query for the discriminative task? I think this will introduce error when comparing with other models. so what is your method to compare with these models?

If you use the method of comparing logits, you do not need to refactor the query. This is how we evaluate open-source models in paper. For black box models, such as GPT-4V, we experimentally found that adding a few queries that restrict output did not significantly impact performance.

Could you please give me a code example on open-source models? so that I can experiment it.

junyangwang0410 commented 15 hours ago

For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?

The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values ​​of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.

so we need refactor the query for the discriminative task? I think this will introduce error when comparing with other models. so what is your method to compare with these models?

If you use the method of comparing logits, you do not need to refactor the query. This is how we evaluate open-source models in paper. For black box models, such as GPT-4V, we experimentally found that adding a few queries that restrict output did not significantly impact performance.

Could you please give me a code example on open-source models? so that I can experiment it.

Could you provide me with the model you need to test and the corresponding code repository address?

workmistm commented 15 hours ago

For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?

The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values ​​of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.

so we need refactor the query for the discriminative task? I think this will introduce error when comparing with other models. so what is your method to compare with these models?

If you use the method of comparing logits, you do not need to refactor the query. This is how we evaluate open-source models in paper. For black box models, such as GPT-4V, we experimentally found that adding a few queries that restrict output did not significantly impact performance.

Could you please give me a code example on open-source models? so that I can experiment it.

Could you provide me with the model you need to test and the corresponding code repository address?

our model is based on internvl2-8b

junyangwang0410 commented 14 hours ago

For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?

The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values ​​of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.

so we need refactor the query for the discriminative task? I think this will introduce error when comparing with other models. so what is your method to compare with these models?

If you use the method of comparing logits, you do not need to refactor the query. This is how we evaluate open-source models in paper. For black box models, such as GPT-4V, we experimentally found that adding a few queries that restrict output did not significantly impact performance.

Could you please give me a code example on open-source models? so that I can experiment it.

Could you provide me with the model you need to test and the corresponding code repository address?

our model is based on internvl2-8b

I checked and confirmed that Intern-VL uses "transformers" architecture. You can use the parameter that returns logits in the model.generate API to get the probability of the first token position on all words. Then you just need to compare the probabilities of the two words "Yes" and "No". You need to get the ids of these two words through the tokenizer.