openai / gpt-2-output-dataset

Dataset of GPT-2 outputs for research in detection, biases, and more
MIT License
1.93k stars 548 forks source link

WebText Dataset format #15

Closed loretoparisi closed 4 years ago

loretoparisi commented 4 years ago

Which is the meaning of length, ended in the dataset lines:

{"id": 1, "ended": true, "length": 66, "text": "LeSean McCoy going through warmups with first team offense. To my eye, does not look close to 100 percent when cutting and exploding.\n\nABOUT COOKIES\n\nTo help make this website better, to improve and personalize your experience and for advertising purposes, are you happy to accept cookies and other technologies?"}

also I can see that there are newlines followed by indexes like in

{"id": 0, "ended": true, "length": 138, "text": "These girlfriends deserves a special mention for going that extra mile, hopefully doesn't set too many guys off on the path towards outrageous demands.\n\n1. She knows the severity of man-flu\n\n2. All fun and games is all good\n\n3. A voucher that says 'I love you'\n\n4. When arguments don't drag on forever.\n\n5. Providing everything he needs.\n\n6. Very understanding\n\n7. As awesome a gesture as this is, we are worried about this man's cooking skills.\n\n8. Nice cake\n\n8. Fair bargaining\n\n9. Excellent gift choice\n\n10. Very thoughtful"}

so \n\n3...\n\n8. What does this mean? Is it just a questionnaire style scraped document?

I can see that the detector does not use those info anyways: https://github.com/openai/gpt-2-output-dataset/blob/master/detector/dataset.py#L17

Thank you.

WuTheFWasThat commented 4 years ago

length is the length in BPE tokens (see gpt-2 paper for information on tokenization scheme). ended is whether the sample contained (and is truncated at) an endoftext token