Closed davisyoshida closed 4 years ago
Ping on this. Could you give us an update if there are any plans to do this?
This has been done now. The paths are like: gs://gpt-2/output-dataset/v1/large-762M-nucleus.train.jsonl
The top_p value was randomly sampled between 0.8 and 1.0 (the exact value is listed in each row)
This preprint (https://arxiv.org/abs/1904.09751) demonstrates that nucleus sampling can generate better text in general than either top-k or full sampling. This has been my experience with both the 117M and 345M parameter models. Releasing a dataset of outputs sampled with this method would be very helpful, especially since it seems that it can generate good outputs without falling prey to the pronoun overuse etc. mentioned in the README.