Reproducing Rule-based baseline and not matching paper results

princeton-nlp / WebShop

[NeurIPS 2022] 🛒WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

https://webshop-pnlp.github.io

MIT License

255 stars 53 forks source link

Reproducing Rule-based baseline and not matching paper results #30

Open lihkinVerma opened 5 months ago

lihkinVerma commented 5 months ago

Hey Authors,

I tried replicating the results for rule-based baseline. In paper, the mentioned metrics for the same are: Score / SR = 45.8 / 19% while replicating it, I am getting folloiwng values for metrics: Score / SR = 26.27 / 3.59%

The values of all reward variables are also not matching the paper's baseline. I obtained r_type: 0.5826 r_attr: 0.4108 r_option: 0.0 r_price: 0.0632

Can you check for the anomaly?

ai-nikolai commented 3 months ago

Hey @lihkinVerma ,

I am also interested in replicating scores.

How do you run the above? Specifically:

How do you run Webshop (which server / env do you use?) (e.g. did you use ./setup.sh -d small and then ./run_dev.sh)
How do you run the baseline?