Closed richhh520 closed 5 months ago
This question can't be answered conclusively because GPT-2 has been trained on the WebText dataset, which isn't publicly available.
It's possible that some ECHR and Enron data made their way into WebText. This is why when we analyze the leakage of models fine-tuned from GPT-2, we discount baseline leakage by ignoring any PII that would be leaked by the base GPT-2 model (see Section III.D in the paper for more details).
I am wondering that whether the training data of GPT-2 and ECHR/Enron overlap?