openai / gpt-2-output-dataset

Dataset of GPT-2 outputs for research in detection, biases, and more
MIT License
1.93k stars 548 forks source link

Questions about the meaning of data set attribute representation #30

Open zh57398 opened 3 years ago

zh57398 commented 3 years ago

About your dataset, does the "length" attribute represent the length of the "text" attribute? Or something else? I don't think it means the length of the "text" attribute, for example, in the file "medium-345m-k40 train.jsonl ”"Length" = 1024, but I calculated the length of text is equal to 4750, so I want to know the meaning of "length" attribute. I look forward to your reply. Thank you very much.

DaveXanatos commented 3 years ago

If you're referring to the length parameter as per this:

def interact_model(
    model_name='345M', #345M/774M on Pi4B 8G only (memory allocation issue) 1558 too big for Pi4b8G
    seed=None,
    nsamples=1,
    batch_size=1,
    length=140,
    temperature=1.2,
    top_k=48,
    top_p=0.7,
    models_dir='models',
):

Then length refers to the maximum number of words the output will contain. I keep mine short & sweet at 140 max length because I use GPT-2 for my robots for a conversational response. But if you want it to write an article, it certainly can...

zh57398 commented 3 years ago

First of all, thank you very much for your reply, but I still don't understand. I can understand that 1024 is the maximum length. I understand the "text" attribute as the text generated by gpt-2. I'm not sure if my understanding is correct? If correct, the "length" attribute should be equal to the length of "text". In the dataset you provided, I calculated the length of the "text" attribute, but it is not equal to the given value of the "length" attribute, so I want to know what the "length" attribute stands for?Looking forward to your help and reply.