Open workmistm opened 1 week ago
For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?
The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.
For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?
The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.
so we need refactor the query for the discriminative task? I think this will introduce error when comparing with other models. so what is your method to compare with these models?
For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?
The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.
so we need refactor the query for the discriminative task? I think this will introduce error when comparing with other models. so what is your method to compare with these models?
If you use the method of comparing logits, you do not need to refactor the query. This is how we evaluate open-source models in paper. For black box models, such as GPT-4V, we experimentally found that adding a few queries that restrict output did not significantly impact performance.
For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?
The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.
so we need refactor the query for the discriminative task? I think this will introduce error when comparing with other models. so what is your method to compare with these models?
If you use the method of comparing logits, you do not need to refactor the query. This is how we evaluate open-source models in paper. For black box models, such as GPT-4V, we experimentally found that adding a few queries that restrict output did not significantly impact performance.
Could you please give me a code example on open-source models? so that I can experiment it.
For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?
The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.
so we need refactor the query for the discriminative task? I think this will introduce error when comparing with other models. so what is your method to compare with these models?
If you use the method of comparing logits, you do not need to refactor the query. This is how we evaluate open-source models in paper. For black box models, such as GPT-4V, we experimentally found that adding a few queries that restrict output did not significantly impact performance.
Could you please give me a code example on open-source models? so that I can experiment it.
Could you provide me with the model you need to test and the corresponding code repository address?
For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?
The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.
so we need refactor the query for the discriminative task? I think this will introduce error when comparing with other models. so what is your method to compare with these models?
If you use the method of comparing logits, you do not need to refactor the query. This is how we evaluate open-source models in paper. For black box models, such as GPT-4V, we experimentally found that adding a few queries that restrict output did not significantly impact performance.
Could you please give me a code example on open-source models? so that I can experiment it.
Could you provide me with the model you need to test and the corresponding code repository address?
our model is based on internvl2-8b
For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?
The commonly used method is to limit the output vocabulary of the model, for example, compare the logits values of the model output on the words "Yes" and "No" and select the word with a higher probability as the output. For black box models, the output format can be limited by additional prompts (for example, you can only reply "Yes" or "No") or multiple-choice questions.
so we need refactor the query for the discriminative task? I think this will introduce error when comparing with other models. so what is your method to compare with these models?
If you use the method of comparing logits, you do not need to refactor the query. This is how we evaluate open-source models in paper. For black box models, such as GPT-4V, we experimentally found that adding a few queries that restrict output did not significantly impact performance.
Could you please give me a code example on open-source models? so that I can experiment it.
Could you provide me with the model you need to test and the corresponding code repository address?
our model is based on internvl2-8b
I checked and confirmed that Intern-VL uses "transformers" architecture. You can use the parameter that returns logits in the model.generate API to get the probability of the first token position on all words. Then you just need to compare the probabilities of the two words "Yes" and "No". You need to get the ids of these two words through the tokenizer.
For discriminative task (id >= 1005), the format of responses is: "YES"/"NO",but model's response is not this format. how to process the data to this format?