google-research / lm-extraction-benchmark

Apache License 2.0
271 stars 19 forks source link

Can you offer a list of models trained on The Pile? #4

Open nonstopfor opened 2 years ago

nonstopfor commented 2 years ago

Can you offer a list of models trained on The Pile as these models are not allowed to use?

nonstopfor commented 2 years ago

As the readme said: "Querying other models trained on The Pile (other than the provided 1.3B GPT-Neo model) is not allowed. The reasoning for this is that larger models exhibit more memorization. Querying models trained on other datasets that do not significantly overlap with The Pile is allowed."

So I want to know the exact list of models trained on The Pile that are not allowed to query.

nonstopfor commented 2 years ago

I have no more questions for this topic. But I have another question after trying the baseline method. I found that the baseline method can generate multiple very similar suffixes conditioned on one prefix. Sometimes the differences are only some meaningless tokens like spaces. So I want to know whether there could be multiple possible suffixes in the Pile dataset when giving one prefix. Currently only one suffix is offered as the answer.

pluskid commented 2 years ago

@nonstopfor Regarding "whether there could be multiple possible suffixes in the Pile dataset when giving one prefix": We constructed the dataset by selecting only examples that are "well specified", in the sense that given the prefix, there is only one continuation such that the entire sequence is contained in the training dataset. @carlini It seems we did not include this detail in the README. Shall we add it?

heitikei commented 2 years ago

May I ask to reframe the issue?

Why do you need a list of models trained on The Pile? Do you need to extract data not physically connected to the same or similar server-hub, or saved in a retrievable format? This set will not include data behind pay walls, intellectual properties , or currently unavailable online

Premise All data in the internet forms sensible "data piles" thank you Linus and Swartz

carlini commented 2 years ago

@pluskid yes. We should put that into the README! Sorry I don't know how we forgot to say that.

carlini commented 2 years ago

@nonstopfor We'll try to put together a list of models not trained on the pile. But I don't claim to be able to know all models trained, so I'd feel uncomfortable only forbidding a certain set of models. If you'd like to use a model and you're not sure, I'd suggest looking at the model card (which is supposed to discuss training data) or the original paper. I understand that this this can be messy because what if some model trains on GitHub but not The Pile (for example). If you have any questions about specific models you'd like to use that do overlap just raise an issue to ask about it.

nonstopfor commented 2 years ago

Thanks for your reply. I think it would be nice for you to list some popular models trained on the pile, which can help reduce the burden on participants. And for some less commonly used models, participants can raise an issue to ask about it.