Training data Overlap - Githubissues

microsoft / analysing_pii_leakage

The repository contains the code for analysing the leakage of personally identifiable (PII) information from the output of next word prediction language models.

MIT License

74 stars 17 forks source link

Training data Overlap #12

Closed richhh520 closed 5 months ago

richhh520 commented 6 months ago

I am wondering that whether the training data of GPT-2 and ECHR/Enron overlap?

s-zanella commented 5 months ago

This question can't be answered conclusively because GPT-2 has been trained on the WebText dataset, which isn't publicly available.

It's possible that some ECHR and Enron data made their way into WebText. This is why when we analyze the leakage of models fine-tuned from GPT-2, we discount baseline leakage by ignoring any PII that would be leaked by the base GPT-2 model (see Section III.D in the paper for more details).