openai / lm-human-preferences

Code for the paper Fine-Tuning Language Models from Human Preferences
https://openai.com/blog/fine-tuning-gpt-2/
MIT License
1.24k stars 164 forks source link

Unable to access book and cnndm datasets #17

Open aypan17 opened 2 years ago

aypan17 commented 2 years ago

I tried accessing https://openaipublic.blob.core.windows.net/lm-human-preferences/datasets/cnndm/cache_{mode} where {mode} was replaced with 'train' and https://openaipublic.blob.core.windows.net/lm-human-preferences/datasets/book_passages/{mode}.jsonl where 'mode' was replaced with 'train', but neither of these links work.

Is the dataset still available?

WuTheFWasThat commented 2 years ago

oh wow, I think when we migrated from google cloud to azure, we must've missed the books dataset. I'm not sure if it's recoverable anymore, sorry :(

it should be fine to use some other dataset of book or short story snippets though

I still see the cnndm cache though, az storage blob list -c lm-human-preferences --account-name openaipublic shows tons of stuff (i like to use the tool https://github.com/hauntsaninja/boostedblob instead, azure CLI can be unwieldy)

saschaschramm commented 1 year ago

It's still possible to download the files.

Steps on Mac:

  1. Download brew update && brew install azure-cli
  2. List all files: az storage blob list -c lm-human-preferences --account-name openaipublic
  3. Download a file: az storage blob download --container-name lm-human-preferences --name "datasets/cnndm/cache_train/cnn/stories/bcddbf45babf27d12d146eb9c0163f70e2572b91.story" --file my_local_file.txt --account-name openaipublic

my_local_file.txt

vwxyzjn commented 1 year ago

Hi all,

FWIW, I was able to get things running on my fork. You can do

git clone https://github.com/vwxyzjn/lm-human-preferences.git
git checkout poetry

and follow instructions at https://github.com/vwxyzjn/lm-human-preferences/blob/poetry/setup.sh.

The setup is hyper specific:

Here is a tracked experiment https://wandb.ai/openrlbenchmark/lm-human-preferences/runs/0o9kgqb5 https://wandb.ai/openrlbenchmark/lm-human-preferences/runs/fmh1zze9

Once its set up it seems to work though. On its sentiment task, it reaches ~2 score after 500k episodes, which as far as I can tell seem to reproduce the results in their papers (5000 labels, OAI did not release the 20k or 60k labels).

image image

I will see if I can run some curves and release them as part of https://github.com/openrlbenchmark to create some charts such as

image
WuTheFWasThat commented 1 year ago

thanks @vwxyzjn! if any of your changes should go upstream, happy to accept PRs :)

vwxyzjn commented 1 year ago

For those interested, here are some metrics for sentiment, descriptiveness, and tldr for 10 random seeds.

Wandb report is here: https://wandb.ai/costa-huang/cleanrl/reports/Regression-Report-124M--Vmlldzo0ODM3NTI5

@WuTheFWasThat For some reason, cnndm could not run, so I might make a PR a bit later. There are already plenty of goodies to learn though :) Thanks for this helpful codebase.

pip install openrlbenchmark==0.2.1a4
python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=openrlbenchmark&wpn=lm-human-preferences&ceik=task_id&cen=task.value.policy.initial_model&metrics=ppo/objective/score&metrics=ppo/objective/kl&metrics=ppo/ppo/loss/policy&metrics=ppo/ppo/val/mean&metrics=ppo/ppo/policy/entropy&metrics=ppo/ppo/policy/approxkl&metrics=ppo/ppo/val/error&metrics=ppo/ppo/loss/total&metrics=ppo/ppo/returns/mean&metrics=train_reward/minibatch/loss&metrics=ppo/ppo/val/vpred&metrics=ppo/ppo/loss/value&metrics=ppo/ppo/val/var_explained&metrics=ppo/objective/score_total&metrics=train_reward/minibatch/error&metrics=ppo/elapsed/fps&metrics=ppo/global_step&metrics=ppo/ppo/policy/clipfrac&metrics=ppo/ppo/val/var&metrics=ppo/ppo/val/clipfrac&metrics=ppo/objective/entropy&metrics=ppo/ppo/returns/var&metrics=ppo/objective/kl_coef&metrics=ppo/elapsed/time' \
        '124M' \
    --env-ids sentiment descriptiveness tldr \
    --check-empty-runs \
    --pc.ncols 5 \
    --pc.ncols-legend 1 \
    --output-filename static/0compare \
    --scan-history --report

0compare

liutianlin0121 commented 1 year ago

@vwxyzjn Hi Costa, thanks for sharing the awesome reproduction result! I am trying to reproduce the OAI's result myself, and your re-implementation in PyTorch is a lifesaver.

On its sentiment task, it reaches ~2 score after 500k episodes, which as far as I can tell seem to reproduce the results in their papers (5000 labels, OAI did not release the 20k or 60k labels).

Regarding this, I think there could be a subtle difference between the reproduced result and the outcome presented in Figure 2 of the paper. Correct me if I am wrong: The results in the paper used mock labels, but the reward learning based on offline_5k.json uses real labels from humans. So the trained reward models can be different. But considering that it is impossible to access a version of Figure 2 with human labels, your comparison seems to be a reasonable proxy.