openai / gpt-2

Code for the paper "Language Models are Unsupervised Multitask Learners"
https://openai.com/blog/better-language-models/
Other
21.92k stars 5.43k forks source link

Release The Full Model! #16

Open superjayman opened 5 years ago

superjayman commented 5 years ago

I understand your concerns but I still think it's better to release the full model now and let people poke at it's abilities and discover potential issues quicker.

WuTheFWasThat commented 5 years ago

Thanks for raising the issue! People have expressed similar sentiment internally and we take that argument seriously. Would love to see people start investigations with the small model and we will be re-evaluating our release of the larger models in the future.

WuTheFWasThat commented 5 years ago

Actually, it seems more correct to leave this issue open :)

yzho0907 commented 5 years ago

plz release the models that support more languages

gabefair commented 5 years ago

Better safe, then sorry. If the experts want caution, the least we can do is respect their judgement

Franck-Dernoncourt commented 5 years ago

https://blog.openai.com/better-language-models/:

We will further publicly discuss this [model release] strategy in six months. If you’d like to discuss large language models and their implications, please email us at: languagequestions@openai.com.

roschler commented 5 years ago

Will you be releasing the English speaking unicorns to the public?

WuTheFWasThat commented 5 years ago

We don't know enough about unicorns to say they aren't dangerous. We will release a unicorn fetus for the scientific community to study for now, and re-evaluate later.

superjayman commented 5 years ago

It's a pity, let me remind you the name 'OpenAI' well not so open is it?

yzho0907 commented 5 years ago

@superjayman i agreed, open and sharing is the core of open innovations and release all does not do any harm but improves it much more quicker.

bnealey commented 5 years ago

Thanks for exercising caution and pointing out that you did. Seems cool.

Curious about focus. Can haz enlightenburger?

marca-development commented 5 years ago

Can you train this on a list of translated sentences (from english to japanese for example) and use it as an AI language translator?

Tophness commented 5 years ago

Isn't this exactly what OpenAI was not supposed to be about? Being closed source and up to the whim of PR teams and private incentives of a small number of people? I've made my own natural language generator that only says things that makes sense (and ask itself questions based on it's own answers) and it does the same thing. I also got it to believe in a god and break down from anxiety over "should" questions in a really insightful way that would be helpful to alot of people. Turns out if you don't internally ask or answer "should" questions at all it's really hard to get into a social anxiety loop, and you can see it all break down like "what if they think [this] > what should I think about the human thinking [this] about me? > idk > I haven't talked to the human because I was thinking about this > what if they think [this] now is this good or bad? Should I care?" Wouldn't have known that if I stopped working on it like you guys did. It's like you're trying to answer the trolley problem like it's some kind of moral dilemma. Almost none of them are. It's an engineering problem. The velocity and mass of the trolley is not unknown, and there's 9 different ways you can stop the train using physics, but if you stand there thinking about whether you should pull it or not, you're fucked either way.

Even for fake news, this would be a good tool. If you're going to believe something just because it's possible to say it using the english language, you're a fucking idiot. Check the source. Check peer reviews. If anything, a random blog post generating fake news like this will point out how stupid people are for believing it. And it's much easier to do that than it is with mainstream media.

fallenartist commented 5 years ago

Maybe, as in POI, they are still teaching their child to be kind?

chenyangh commented 5 years ago

I respect the decision, "with great power comes great responsibility". But I suggest the releasing of the 345M model. The reasons come in two folds: it is much better than the 117M but not nearly as good as the 1.5B model; it has a similar amount of parameters to the BERT-large-uncased model which makes it a good candidate to be compared with.

yzho0907 commented 5 years ago

Maybe the reason this git exists is the same that the team should release all and be 'open' but they are the ones who make the decision anyway. I just hope that it would be good for both of the team and us.

max-frai commented 5 years ago

There are a lot of things could be used in wrong direction in bad hands. But just imagine the positive feedback of your technology. Your fears are inevitably in anyway.

dackdel commented 5 years ago

https://news.ycombinator.com/item?id=19168712 please read the horrendous comments. well dont read all of them it gets depressing. but jesus christ. you are OPEN ai. do we really have to spell that out for you. O P E N

sciencemanx commented 5 years ago

Help I need this to help write my 9th grade essays

jensstark commented 5 years ago

Okay. A nonprofit writes an interesting product, which a for - profit could probably recreate and patent. Or am I wrong there?

Why release a teaser only? A shrunk, non-trainable thing, just there to show off?

I admit it. I am suitably impressed, but also seriously annoyed

"open" in name only.

lahwran commented 5 years ago

It's worth remembering that openai has in fact been pretty good about releasing the code to their stuff. They've been much more open than deepmind, which I think was the concern that lead to their creation. This seems comparable to responsible disclosure in software security - when an open source group finds a bug in widely-deployed, un-updateable software, eg something used in routers or etc, that could be used for large scale spamming, they'll start work on ways to mitigate it before announcing what the vulnerability is. If someone who works for a FOSS company were to find a really efficient design for building software to do denial-of-service attacks, it'd be a similar story - look for DOS mitigations first.

I'd say it's a comparable situation to the latter: OpenAI is worried that they've built a generally useful tool that could make a category of DOS attack much much worse, and they don't currently see anything preventing that from happening.

I've been thinking that it might be good to get something like GPT2 1.5B into the hands of google and facebook and a few other major forum operators, maybe reddit, under a contract to use it for improving moderation. (edit to clarify: just giving them early access so they can use it to build safeguards against things like it.) It seems like GPT2 is good enough to take a serious crack at implementing xkcd's suggestion from nearly ten years ago: who cares if it's a human or a machine? the real question is whether it's malicious content. That proposal as it is wouldn't help so much with fake news, because people lying is a different problem than people doing a denial-of-service via vitriol, but it would make a big impact on a major source of the problem. Or perhaps the AI teams with enough resources could get together and talk about how to use this level of NLP performance to build other types of linguistic DOS mitigations.

I am, for my own curiosity, quite irritated that it's not being released, but I agree that the performance is reasonably worthy of the concern. I just don't see not releasing it as being that useful unless the time until someone replicates it is spent making mitigations to the world that will be created when someone else has a copy.

@WuTheFWasThat I do think yall could probably release the training code safely, though. Seems to me that it's the dataset and something like $40k worth of compute of trained model that are the real interesting thing here.

4R7I5T commented 5 years ago

I think you guys are scared of nothing, release the whole model please.

It's not like you have 20,000 people who pulled this repo - so it's really hard to use this 'maliciously'

Besides there are other alternatives that have produced similar (better) results than what this is - cakechat for example when fed the Reddit corpus (the same one that spooked you) you'll get some crazy things. But just like when you tell a young kid 'its just a movie' or 'just a game' - this is just a computer program. It's not some sci fi novel come to life.

schwittlick commented 5 years ago

We want the red pill!

iurimatias commented 5 years ago

The resources needed to train the full model are beyond the average person and small companies which could use this for potentially very interesting non-malicious applications. However large organizations and state actors that are most likely to use this for malicious purposes can and typically do already have easy access to the resources needed to replicate the full model.

Therefore by not releasing the full model you are ensuring that this sort of AI tech remains in the hands of powerful organizations and state actors that are most likely to misuse it while at the same time unintentionally tricking the general public to think this tech is not "really" available yet. Releasing the full model & leveling the playing field is the right thing to do here. Please release the full model.

superjayman commented 5 years ago

So How Many Other Innovations Are You Guys Going To Keep Closed? Say next week you have an even bigger break-thru , will the full model now be superseded and seem less harmful and you may decide to release it?.. see it does not make sense, how do you put a limit on unknown capabilities?

gabefair commented 5 years ago

Everyone here could benefit from Nick Bostrom's The Unfinished Fable of the Sparrows as presented in his 2014 book about this subject, Superintelligence: Paths, Dangers, Strategies. Dr. Bostrom is Director of the Future of Humanity Institute at the University of Oxford. https://youtu.be/7rRJ9Ep1Wzs

bhack commented 5 years ago

I think at least you can open soon a challenge like i.e. Google's fake audio detection challenge and then release the full model after the community has a detection baseline.

freecode-ai commented 5 years ago

Well.. Here is what I predict will happen very soon and why. The thing your software can do will be replicated and released for the whole world within months, maybe even weeks. It will grow just like deep fakes and college students will be using it to write their finals in the fall. The media has blasted the fact that you have a new toy and you refuse to share. Now that people know what type of coverage they can expect for a full released version they will not care about consequences. They will get the publicity and the feedback they need to make it even better. From that point forward all the phone apps, diy personal assistance devices, and automated blog post generators will say powered by [insert company].. Yes that same company name will be associated with the fake Amazon reviews but when it comes to business and economics bad pablicity is still pablicity. The up side is that instead of "encouraging" the government and other agencies to address these issues they will be forced to. This train is coming and I am afraid that you putting pennies on the track is not going to stop it. Heck, my 15 year old uses python in ways it would never have occurred to me. Honestly, I personally couldn't pull this off without a team but I am sure there are investors out there that see dollar signs in being first. I am sure you have gotten some very interesting e-mails reinforcing that sentiment. If I were in your position I would reconcider my decision to release the full project or at least set a date. People tend to be more productive when they are up against the clock. 90 days would certainly be enough time for these big companies to prepare and more than enough time for governments to educate their patrons about the swarm of "fake news" headed their way. I read that last sentence and spit my drink out hahahaha.. Anyway, read my post in your meeting Monday morning and reevaluate your decision. Great job by the way. It must be awesome to see the results first hand. //This post was written by a human.//

joemillervi commented 5 years ago

Release the kraken!

superjayman commented 5 years ago

Well.. Here is what I predict will happen very soon and why. The thing your software can do will be replicated and released for the whole world within months, maybe even weeks. It will grow just like deep fakes and college students will be using it to write their finals in the fall. The media has blasted the fact that you have a new toy and you refuse to share. Now that people know what type of coverage they can expect for a full released version they will not care about consequences. They will get the publicity and the feedback they need to make it even better. From that point forward all the phone apps, diy personal assistance devices, and automated blog post generators will say powered by [insert company].. Yes that same company name will be associated with the fake Amazon reviews but when it comes to business and economics bad pablicity is still pablicity. The up side is that instead of "encouraging" the government and other agencies to address these issues they will be forced to. This train is coming and I am afraid that you putting pennies on the track is not going to stop it. Heck, my 15 year old uses python in ways it would never have occurred to me. Honestly, I personally couldn't pull this off without a team but I am sure there are investors out there that see dollar signs in being first. I am sure you have gotten some very interesting e-mails reinforcing that sentiment. If I were in your position I would reconcider my decision to release the full project or at least set a date. People tend to be more productive when they are up against the clock. 90 days would certainly be enough time for these big companies to prepare and more than enough time for governments to educate their patrons about the swarm of "fake news" headed their way. I read that last sentence and spit my drink out hahahaha.. Anyway, read my post in your meeting Monday morning and reevaluate your decision. Great job by the way. It must be awesome to see the results first hand. //This post was written by a human.//

100 Percent, Agreed! It will not be long until this is replicated anyway.

clintonm9 commented 5 years ago

Is it sad that I wrote a program to check each https://storage.googleapis.com/gpt-2/models/* directory from 1 to 999 for the letters M, G, & T? The only thing it found was 117M ;)

rjkmelb commented 5 years ago

I've never placed opinion into a github issue before.

Please release the model and the code for the means to train it.

The open source AI community is the only hope we have to balance out the interests of Google and the rest of us. Someone will take this paper and build their own replica and OpenAI will be remembered as an organisation that completely contradicts its name.

For us to be able to move forward with open source AI we need companies like OpenAI to remain true to their mission, without that you might as well be Google.

freecode-ai commented 5 years ago

By the way if you came here just because you are interested in the concept and possibly wanted to train your "own" AI bot. Here is a good source of big data to pull from. It's all the reddit post ever. http://files.pushshift.io/reddit/comments/ And if you have no idea what you are doing but are interested in learning things like this. Here is an older YouTube series that shows you how you could use that data set. (Its not gpt-2 level but it will get you started.) https://youtu.be/dvOnYLDg8_Y Now you just need a little compute and a lot of time. Good Luck.

pushshift commented 5 years ago

I am the maintainer of Pushshift.io. The files are generally updated monthly. I also have a few other datasets (all of Stackoverflow and Hackernews is archived as well). Let me know if you have any suggestions on how to make the data more easily accessible, etc. I am currently working on gathering tweets from all verified Twitter accounts (the total dataset is 2-3 billion tweets and contains data for every Twitter verified account -- around 350,000 accounts). That data is available for research purposes. Feel free to ping me at jason@pushshift.io if you need any other data that may help the project.

Good luck!

yet-another-account commented 5 years ago

I am currently using pushshift to get a list of URLs from reddit matching the descriptions in the paper to build my own version of WebText. Unfortunately, due to copyright restrictions, I don't think it would be possible to host the actual data from the pages constituting WebText. What would be the best way to get around this limitation?

qnkhuat commented 5 years ago

@eukaryote31 what do u do with that the data u crawl? this repo didn't include code for us to train the model

brainmaniac commented 5 years ago

I respect and support your decision not to release. Mostly because y'all know more than me and therefore make much more well founded decisions in the area. Hence my complete compliance with whatever decision you make..

BUT. If open it all up would make progress faster.. maybe YOLO.

yet-another-account commented 5 years ago

@qnkhuat I'm hoping that this data will be useful as a baseline for other researchers.

ZeroCool940711 commented 5 years ago

@WuTheFWasThat With all respect but I think the fear of malicious use and other stuff people say about AI every time the topic comes out is stupid. These things should not be kept closed or limited to some people only. Think about HUMAN INTELLIGENCE, there are people in the world with a really high IQ, some of them use their knowledge for bad purpose, the only way to stop them is by having another person with the same or higher IQ or multiple person to fight against them, now apply the same thing to AI, AI should not be fear as long as there are more than one AI capable of doing the same thing, if there is only one AI that can be used for X bad purpose and you only release it for a few people then you are making it even easier to use it for bad purpose, it's not a matter of IF people will use it for bad purpose but of WHEN, on the other side if you release it so everyone can use it then you are warrantied to have a way to fight back on the worst case scenario, also there will always be people that will find ways to make things safer, by limiting who can use things you are playing a dangerous race of who can get their hands first on these things, remember on every movie and everywhere the good guys always have the loosing hand as bad guys don't care about the rules and its easier to brake things than fix them.

bitnom commented 5 years ago

That this algorithm has been hobbled is outrageous. This goes against everything that open ai was supposed to be about. I've been working with a bert algo variant and I could really use gpt-2, certainly not for some far-fetched nefarious purpose. This model is good but it's not so good that it's going to disrupt the Internet as has been suggested. It's going to generate ridiculous information anyone can fact-check. This feels like yet another attempt at deciding what people are capable of consuming.

Let it happen. This is not the algorithm to draw a line in front of. The benefits outweigh the potential cost, in this case.

jprester commented 5 years ago

As someone who agrees with sentiments expressed in this SlateStar post I feel the need to commend OpenAI for behaving so responsibly in this case.

While I don't think that anyone believes this Algorithms would be as dangerous as true ASI, it is good idea to already start to practice limiting transparency of potentially dangerous research to prepare for the future where this will be a necessity.

With that said... there are bunch of smart AI researchers in the world and everything is progressing so fast that I am not sure how long this will be state of the art. Anyway...continue with the good work.

jensstark commented 5 years ago

As someone who agrees with sentiments expressed in this SlateStar post I feel the need to commend OpenAI for behaving so responsibly in this case.

The problem is not: "What will others do with it?"

There is a phenomenon in technology. Things come into life when the time for them is right. We had loads of people who simultaneously invented computers, light bulbs, phones etc.pp. in many countries. Who invented the phone? That depends. Edison? Reis? It all depends on which country you live in.

Once you KNOW it is possible to build a light bulb, it is much easier to build your own one. Especially if you have a smaller model to play with, which has a lot of the technical details built in. Reverse engineering saves a lot of effort in engineering…

NOT releasing the full model will work. For a couple of months. Then, corporates will have their own, similar models - Alphabet could throw a couple of hundreds of developers at it. If they haven't already. The same applies to a number of other IT companies or even government agencies. The OpenAI model will become irrelevant at that time at the latest, with the standard being a closed source, proprietary thingie.

The small model has been published, details of the large model have been shared. Any party with enough human capital can work from there.

And yes, I wish that the label "Open" should be restricted to entities which are open. I worked for a company which decided to sell an "open" interface to their products. I worked on Linux drivers at that time and found their approach anything but open. Open products need no license keys, science can build on them because everything is easily available and documented. That company's approach to "Open" was shameful- and they even ended up with closed source drivers. And a security feature which an university reverse engineered because they needed an open (!) solution for it.

What OpenAI does right now is bragging. "Look at the cool, shiny toy! You cannot have it, because we decided that you should not be able to do so."

Does it help reseach? Or anything but the developer's ego?

I might be biased. I do not expect people to share my opinion, I also do not want to become a second RMS - my heroes are more RMS and Linus Torvalds.

But I hate "Open" being reduced to a marketing term instead of a philosophy.

jensstark commented 5 years ago

Let me refine my position a little. The big model contains both the software bit for training and running and the results of training.

Releasing the pre-trained big model would be fun, but even if CS can be fun, that would not be the main reason to release everything.

I could live well with the software part being published - it would keep people from doing bad stuff unless they were willing to put loads of effort in. Even the small model, complete with the ability to train it, would be worth more than what has been released, even if there was no pre-trained system involved.

jamfor352 commented 5 years ago

Honestly, the biggest issue is that now we know it is possible, large private companies will be throwing as much weight as they can behind this to create their own proprietary versions. Once one is successful, it'll get out there either way. So something like this is bound to be out there - the question only boils down to "will it be open source"? OpenAI can ensure that it is by releasing it.

It really is that simple.

yunjiangster commented 5 years ago

Releasing the second largest model makes a lot more sense since it’s presumably already on par with BERT so that people have more incentives to use it in real work.

bitnom commented 5 years ago

Let me refine my position a little. The big model contains both the software bit for training and running and the results of training.

Releasing the pre-trained big model would be fun, but even if CS can be fun, that would not be the main reason to release everything.

I could live well with the software part being published - it would keep people from doing bad stuff unless they were willing to put loads of effort in. Even the small model, complete with the ability to train it, would be worth more than what has been released, even if there was no pre-trained system involved.

It's not so much that it would be fun. It's that the full model is really useful, as trained and to train anywhere near it would be very, very expensive. Thank goodness the bert model came with its full pretrain or it would have cost something like $40,000 if I'm remembering that correctly.

Serkan-devel commented 5 years ago

(((Open)))AI™

KoolenDasheppi commented 5 years ago

OpenAI, the most open AI company to ever be created. /s

yunjiangster commented 5 years ago

Another short term suggestion is to set up a rate-limited public demo server to reproduce the unicorn example.

bitnom commented 5 years ago

It's also worth noting that the text generated by the full model is detected as fake by Fakebox. So this isn't even a case of 'Let's let the technology catch up.' I'm really starting to think that this withholding was to garner headlines. Who thinks the world isn't ready for unbelievable/nonsensical AI articles? We already had that before. The only difference now is that this reads more coherent. It doesn't make it anymore believable. I fail to see any danger.

Serkan-devel commented 5 years ago

Is there any other AI-research organization which is more open source than ((open))™ ?

Maybe it's easier to develop from there instead