microsoft / DialoGPT

Large-scale pretraining for dialogue
MIT License
2.36k stars 341 forks source link

Understanding the train.tsv file #28

Closed wise-east closed 4 years ago

wise-east commented 4 years ago

After running python demo.py --data small and looking at the resulting train.tsv file, I want to make sure I have the correct understanding of the format and what the float values indicate.

For example, the first two examples look like:

t3_17830,t1_c24,t1_c40  0.0 On the bright side , despite kidnapping and cruelly abandoning him , it doesn't sound like he was tortured ...  1.0 We didn't torture somebody ! USA
t3_17844,t1_c88,t1_c95  1.0 will comments dissapear if ranked low enough ? I can just see the pages with 5000 comments now ..   1.0 not yet , but we'll play around with it

From the paper, I see that there was some heavy preprocessing and filtering done, such as removing offensive and bland training instances. Are the sequences prepended with 0.0 the filtered instances that will not be used to update the weights during training? Based on my understanding of the code, the weight 0.0 ensures this by setting the language modeling labels to -1:

https://github.com/microsoft/DialoGPT/blob/18d91ce5a4e1c32e2b097829c5c3de5135879420/prepro.py#L108-L110

What I'm confused with is that I can't seem to find how the training process ignores the preprended identifiers of each line (ex: t3_17830,t1_c24,t1_c40). How does this part of the training data gets ignored?

intersun commented 4 years ago

Thanks for pointing out the bug. The identifier on each line, i.e., t3_17830,t1_c24,t1_c40 and etc, is supposed to be removed in order to run prepro.py.

Please leave a comment if you have more questions.

ferdinando17 commented 4 years ago

Hello, thanks for releasing the work.

What are the meaning of 0.0 and 1.0? Can you confirm that it is what wise-east says ?

wise-east commented 4 years ago

@ferdinando17 I'm pretty sure my understanding is correct given Hugginface's documentation.

Aman-4-Real commented 2 years ago

@ferdinando17 I'm pretty sure my understanding is correct given Hugginface's documentation.

I had a look at these sessions containing 0.0 utterances, and I found that most of them contains violence, pornography, inappropriate expressions, etc. I agree with you and I just picked all the sessions without 0.0 utterances, which resulted in around 100 million sesseions (out of 146,846,215 mentioned in README).