iMeanAI / WebCanvas

Connect agents to live web environments evaluation.
https://www.imean.ai/web-canvas
MIT License
197 stars 11 forks source link

Unable to replicate results for Mistral-7B-Instruct-0.3 #20

Closed vardaan123 closed 2 months ago

vardaan123 commented 3 months ago

Hi

Thanks for the awesome work! I am trying to evaluate Mistral-7B-Instruct-v0.3 https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 using the same prompt as GPT-4 with temp=0.7, max tokens=500. However, I see that the response error rate is quite high (>90%) as the model does NOT predict in JSON format or repeats the input prompt. I also experimented with lower temperature and adding penalty for repetition but that didn't help. Could you clarify if you used the same prompt for evaluating open-source models like Mistral-7B-Instruct (Table 5)?

han032206 commented 3 months ago

Hi

Thanks for the awesome work! I am trying to evaluate Mistral-7B-Instruct-v0.3 https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 using the same prompt as GPT-4 with temp=0.7, max tokens=500. However, I see that the response error rate is quite high (>90%) as the model does NOT predict in JSON format or repeats the input prompt. I also experimented with lower temperature and adding penalty for repetition but that didn't help. Could you clarify if you used the same prompt for evaluating open-source models like Mistral-7B-Instruct (Table 5)?

Hi,

Thank you for your feedback and your efforts in replicating our experiments.

In our experiments, we used Together AI’s services with max tokens=512 and temperature=0.7. The prompts we used are consistent with those provided on our GitHub repository. We replicated the experiments on the Mind2Web-live test set yesterday and found that the results were closely aligned with those reported in our paper.

Could you please check if there are any discrepancies in other parameters or settings? What data are you using for eval? It’s also possible that the model's performance might vary based on the specific setup or the instance accessed during your tests.

Looking forward to hearing from you to resolve this issue!

vardaan123 commented 3 months ago

Thanks for the info! I am trying to do inference by loading a model locally on GPU. I will recheck my generation parameters to make sure they match with together.ai config. I am using Mind2Web-live test set for evaluation using a linux instance in US.

vardaan123 commented 2 months ago

It is resolved now. Turns out I was using greedy decoding instead of sampling.