Release a dataset using nucleus (top-p) sampling

openai / gpt-2-output-dataset

Dataset of GPT-2 outputs for research in detection, biases, and more

MIT License

1.93k stars 548 forks source link

Release a dataset using nucleus (top-p) sampling #5

Closed davisyoshida closed 4 years ago

davisyoshida commented 5 years ago

This preprint (https://arxiv.org/abs/1904.09751) demonstrates that nucleus sampling can generate better text in general than either top-k or full sampling. This has been my experience with both the 117M and 345M parameter models. Releasing a dataset of outputs sampled with this method would be very helpful, especially since it seems that it can generate good outputs without falling prey to the pronoun overuse etc. mentioned in the README.

daphnei commented 5 years ago

Ping on this. Could you give us an update if there are any plans to do this?

WuTheFWasThat commented 4 years ago

This has been done now. The paths are like: gs://gpt-2/output-dataset/v1/large-762M-nucleus.train.jsonl

The top_p value was randomly sampled between 0.8 and 1.0 (the exact value is listed in each row)