Closed wise-east closed 4 years ago
Thanks for pointing out the bug. The identifier on each line, i.e., t3_17830,t1_c24,t1_c40 and etc, is supposed to be removed in order to run prepro.py.
Please leave a comment if you have more questions.
Hello, thanks for releasing the work.
What are the meaning of 0.0 and 1.0? Can you confirm that it is what wise-east says ?
@ferdinando17 I'm pretty sure my understanding is correct given Hugginface's documentation.
@ferdinando17 I'm pretty sure my understanding is correct given Hugginface's documentation.
I had a look at these sessions containing 0.0 utterances, and I found that most of them contains violence, pornography, inappropriate expressions, etc. I agree with you and I just picked all the sessions without 0.0 utterances, which resulted in around 100 million sesseions (out of 146,846,215 mentioned in README).
After running
python demo.py --data small
and looking at the resultingtrain.tsv
file, I want to make sure I have the correct understanding of the format and what the float values indicate.For example, the first two examples look like:
From the paper, I see that there was some heavy preprocessing and filtering done, such as removing offensive and bland training instances. Are the sequences prepended with 0.0 the filtered instances that will not be used to update the weights during training? Based on my understanding of the code, the weight 0.0 ensures this by setting the language modeling labels to -1:
https://github.com/microsoft/DialoGPT/blob/18d91ce5a4e1c32e2b097829c5c3de5135879420/prepro.py#L108-L110
What I'm confused with is that I can't seem to find how the training process ignores the preprended identifiers of each line (ex:
t3_17830,t1_c24,t1_c40
). How does this part of the training data gets ignored?