princeton-nlp / WebShop

[NeurIPS 2022] 🛒WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
https://webshop-pnlp.github.io
MIT License
269 stars 56 forks source link

Score and Success Rate #17

Closed innovator-arjun closed 1 year ago

innovator-arjun commented 1 year ago

Question:

  1. I am training the RL agent using train_rl.py. How can I get the score and success rate metrics as per the paper?

Note: running for the default configuration for reproducibility.

| EpisodeScore | 6.88 | | EpisodeScore_beauty | 7.06 | | EpisodeScore_electro... | 6.68 | | EpisodeScore_fashion | 7.36 | | EpisodeScore_garden | 6.12 | | EpisodeScore_grocery | 7.39 | | FPS | 3 | | ItemsClicked | 78782 | | Step | 259400 | | action_Description | 276853 | | action_Features | 276669 | | action_Reviews | 237 | | action_asin | 276410 | | action_options | 83631 | | action_purchase | 270265 | | action_search | 279761 | | advs | -0.00239 | | cat_beauty | 60688 | | cat_electronics | 47598 | | cat_fashion | 55063 | | cat_garden | 53171 | | cat_grocery | 53880 | | gradnorm_clipped | 670 | | gradnorm_unclipped | 1.61e+03 | | loss | 3.1 | | loss_en | -0.0982 | | loss_il | 0 | | loss_pg | 0.0476 | | loss_td | 3.15 | | r_att | 0.781 | | r_exact | 0.0112 | | r_harsh | 0.326 | | r_norm | 0.196 | | r_option | 0.308 | | r_price | 0.966 | | r_type | 0.953 | | r_visit | 0 | | rank_item | 4.21 | | returns | 4.99 |

  1. Regarding the run time, this is the script I am using to run the code,

    SBATCH --cpus-per-task=16

    SBATCH --gres=gpu:v100:1 # Ask for 1 GPU #v100

    SBATCH --mem=64G

    SBATCH --time=80:00:00

Even after running for 80 hrs, the code for train_rl.py fails at 250K steps out of 300K steps as default. However, in the paper, it is mentioned that the run time is around a day. Am I missing something? Please share your insights.

Thank you for your time

ysymyth commented 1 year ago

Hi,

  1. EpisodeScore is train score, r_harsh is train success rate. Once the first test is done, testScore is the test score, and test_r_harsh is the test success rate.
  2. In the paper, we report the result with 100K steps. Running beyond 100K steps might still bring improvements, thus in code the default is 300K.
ysymyth commented 1 year ago

I'll close it for now but feel free to reopen if needed