Closed vardaan123 closed 2 months ago
Hi
Thanks for the awesome work! I am trying to evaluate
Mistral-7B-Instruct-v0.3
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 using the same prompt as GPT-4 with temp=0.7, max tokens=500. However, I see that the response error rate is quite high (>90%) as the model does NOT predict in JSON format or repeats the input prompt. I also experimented with lower temperature and adding penalty for repetition but that didn't help. Could you clarify if you used the same prompt for evaluating open-source models like Mistral-7B-Instruct (Table 5)?
Hi,
Thank you for your feedback and your efforts in replicating our experiments.
In our experiments, we used Together AI’s services with max tokens=512 and temperature=0.7. The prompts we used are consistent with those provided on our GitHub repository. We replicated the experiments on the Mind2Web-live test set yesterday and found that the results were closely aligned with those reported in our paper.
Could you please check if there are any discrepancies in other parameters or settings? What data are you using for eval? It’s also possible that the model's performance might vary based on the specific setup or the instance accessed during your tests.
Looking forward to hearing from you to resolve this issue!
Thanks for the info! I am trying to do inference by loading a model locally on GPU. I will recheck my generation parameters to make sure they match with together.ai config. I am using Mind2Web-live test set for evaluation using a linux instance in US.
It is resolved now. Turns out I was using greedy decoding instead of sampling.
Hi
Thanks for the awesome work! I am trying to evaluate
Mistral-7B-Instruct-v0.3
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 using the same prompt as GPT-4 with temp=0.7, max tokens=500. However, I see that the response error rate is quite high (>90%) as the model does NOT predict in JSON format or repeats the input prompt. I also experimented with lower temperature and adding penalty for repetition but that didn't help. Could you clarify if you used the same prompt for evaluating open-source models like Mistral-7B-Instruct (Table 5)?