GPT-3 integration with GeDi

yugaljain1999 commented 4 years ago

@benkrause How to integrate trained gpt-3 with GeDi to guide text sequences? My friend has access to small version of trained gpt-3,but you specified in README that we need api key to generate guided sequences through GPT-3 but currently we have just trained gpt-3 so is there any way to integrate trained gpt-3 with GeDi? Which files should I change ? Thanks!

benkrause commented 4 years ago

I may be able to help more if you give me more details of what this model you are trying to integrate is. I don't believe OpenAI released any of the smaller versions of GPT-3 that they trained (but if they did I would love to know where!). Is this a model that someone else trained?

You can specify the LM to be guided with GeDi with the --gen_model_name_or_path to generate_GeDi.py, however, this will only work for GPT-2 or models that were trained in Huggingface Transformers that use GPT-2's architecture. If you have a custom model with a new architecture, one option would be to reimplement modeling_gpt2.py and port the weights, however, this would likely be quite a lot of work. Another hacky option is, if you have another codebase that you can use to get next token logits from your custom language model, you could replace line 950 in modeling_utils.py, and call a function that sets next_token_logits to the logits from your model. Note that next_token_logits should be a 1 X vocab size array that gives the next token logits that result from passing in the input sequence so far.

Also, Open AI's GPT-3 uses the same tokenization as GPT-2, which is why we are able to use the same GeDi to guide both GPT-2 and GPT-3. So if your language model uses the same tokenization as the orignal GPT-2/GPT-3, then the above trick should work. However, if your language model was trained using a different tokenizer, then you would need to train a new GeDi in your language model's tokenization, which would likely be a hassle to implement.

yugaljain1999 commented 4 years ago

Thanks for your comment @benkrause I believe my friend trained GPT-3 himself using free API which OpenAI released two months ago..I have to confirm this thing from himself and then let you know quickly, Thanks for rest of the info you have provided.

One short question: Have you used Bayesian zero shot learning to train zero shot classifier for topics or some other technique? If yes, then Can you please share the code of generative discriminator zero classifiers for which you mentioned paper in Readme but there isn't code for that paper(https://arxiv.org/abs/1703.01898)?

One error I got during training GeDi on my own small data than AG-news on google colab even after fully install NVIDIA APEX as per instructions.. RuntimeError: CUDA out of memory. Tried to allocate 152.00 MiB (GPU 0; 7.43 GiB total capacity; 6.15 GiB already allocated; 94.94 MiB free; 6.65 GiB reserved in total by PyTorch) How can I resolve this error? Should I take sequence length 50 instead of 192(default) or should I take input sequence length same as AG-news data exactly or less? Please help me in this. Thanks!

benkrause commented 4 years ago

Not familiar with that API, but the trick I mentioned should work if you have a way of getting next token logits/log-probabilities, and your model is in the same tokenization as the original GPT-2 and GPT-3.

We don't directly use our GeDi for zero-shot classification, since the goal of the paper is generation. But you could try using the pretrained topic GeDi for binary zero-shot classification and it may work well. This would require some modification to the codebase though. For instance, if you wanted to identify if an article was "crime", you could pass in articles with the word "crime" concatenated to the beginning as see how GeDi classifies them (1 for matches topic, 0 for doesn't match). You could also try this with several different topics for the same article to see which one GeDi assigns the highest probability of a match.

The training script is for 16GB GPUs, to make it work on a smaller GPU, there are several arguments you can try changing:

set --per_gpu_train_batch_size equal to 1 to reduce the batch size. It's possible you may need to reduce --per_gpu_eval_batch_size too to prevent a crash at the end of training, but the model should still save. If you reduce the batch size to 1, I'd recommend also reducing the learning rate (1e-5 seems reasonable).

As you mentioned, you could set --max_seq_length to something lower. This will truncate news articles.

You could change --model_name_or_path from gpt2-medium to gpt2. This will train a smaller model. The smaller model may not work as well, haven't tried it yet.

yugaljain1999 commented 4 years ago

@benkrause thanks for your response..I have a small query that if you didn't use zero shot classification then how did you make GeDi to generate text on unseen topics like 'crime' and classifier further each tokens according to topic as for that zero shot is needed and hence generative discriminative classifiers are also needed for that? Regards!

benkrause commented 4 years ago

Well GeDi inherently does use zero-shot classification to guide generation towards unseen topics, what I meant was we didn't apply to give zero shot labels to real articles, or directly test its zero-shot classification ability, since we were focused on generation.

yugaljain1999 commented 4 years ago

@benkrause Thanks for your response, I got what you said that you didn't apply zero shot classification to annotate zero shot labels to real articles but I was asking about how did you guide generation towards unseen topics? As you wrote in README below that "generative discriminators can be used as zero shot classifiers, and can therefore be used to guide generation towards unseen topics." What does that mean then?

Here what you wrote in README- However, using generative discriminators, we can very efficiently classify candidate next tokens during generation using Bayes rule (see Section 3.1 of the paper). As an added bonus, generative discriminators can be used as zero shot classifiers, and can therefore be used to guide generation towards unseen topics.

Thanks!

benkrause commented 4 years ago

So GeDi is trained to classify whether the topic, which is given by a word, matches the text or not. It can then guide the generation towards text that matches the topic word. We found that this also works even when the topic word wasn't one of the training topics, which is because GeDi can generalize from the word embedding of unseen topics. At every generation step, for every possible next word in the vocabulary, GeDi classifies whether or not that word would match the topic (which is zero-shot classification if the topic wasn't seen during training). It then uses these classification probabilities to help GPT-2 decide what word to generate. It does the same thing again at the next timestep until a whole sequence of words has been generated. So it is essentially doing on the fly zero-shot classification during generation to pick words that lead to sequences that match the new topic.

yugaljain1999 commented 4 years ago

@benkrause Exactly , that's why I am asking for the code which you have used at every generation step for every possible next word in vocab to classify whether or not that word would match the topic or not(which is zero-shot classification if the topic wasn't seen during training). Can you please share the code of this? Thanks

benkrause commented 4 years ago

Oh I see, yeah that is in the codebase.

logp_desired_t on line 1023 of modeling_utils.py gives the log probabilities that the resulting sequence given by every candidate next token belongs to the desired class. So if you are doing topic mode and you pick a topic the model hadn't seen during training, this will be zero shot.logp_desired gives the log probability that the generated sequence so far (not including possible next tokens) is in the desired class.

yugaljain1999 commented 4 years ago

Okay,Thanks for your response.. so it means you have refered this paper "generative and discriminative text classification using RNNs" in order to classifiy whether or not next word would match the topic or not(which is zero shot classification if topic wasn't seen during training).

@benkrause I am surprised how did you get code of this paper released by Google as they didn't refer code in their paper. How did you guys write code of this paper released by Google?

benkrause commented 4 years ago

Maybe our README is confusing. We did not use code from that paper. Our approach to zero-shot classification is related but not exactly the same as what that paper did.

yugaljain1999 commented 4 years ago

Ohh, I see. So what is the name of your approach to zero-shot classification ? Is that based on RNNs , LSTMs or is it transformer based approach? Thanks!

benkrause commented 4 years ago

It doesn't really have a name, it's just something we found that GeDi can do. We used Transformers in our experiments, but the approach isn't architecture specific. In theory it could be applied to other architectures with a strong pretrained representation.

yugaljain1999 commented 4 years ago

Okay, so GeDi itself can do zero-shot classification which you have proposed. Thanks for your response. I wanna make changes in README related to installation instructions of Pytorch for which you have provided link of docker image but that pytorch version isn't compatible to run trained model from scratch. Atleast Pytorch1.6 version is needed to make this code compatible as I have tried running training from scratch on Linux and Windows 10 and it works only for Pytorch1.6 not Pytorch 1.4.

benkrause commented 4 years ago

Hmm, that's odd. We ran all our experiments in PyTorch 1.4. If it's easier to make work with PyTorch 1.6 I can think about changing the README, but we would need to look into it first. Did you install PyTorch using the docker file? Also what error did you get in Linux?

yugaljain1999 commented 4 years ago

I got this error on linux OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

I have installed pytorch via conda on my local machine. I have also tried to run this code with docker image installed Pytorch1.4 but that didn't worked as well. I have tried this code on my local machine where Ubuntu 18.04 is installed with Pytorch 1.4 but when I update Pytorch1.4 to Pytorch 1.6 then that error completely removed and when I run this code on Colab then also this code worked well as there is Pytorch 1.6 on google colab with cuda 10.1. Hope this much information is enough for you to update README.If you have any more questions feel free to ask. Thanks!

yugaljain1999 commented 4 years ago

I wanna make pull request on application part of GeDi where I have trained small speech data of Indian Prime Minister Narendra Modi on GeDi and tried to get to know about his perception on Muslims by giving input some biased words.

Can I make pull request for that as application of GeDi in speech data of Prime Ministers? Further I am also thinking to generate speeches of President Donald trump about Whites and Blacks. If you allow me, then I can pull that too.

Apart from this I have one more query Can I check the effectiveness of speeches given by Ministers and VIPs on crucial matter or effectiveness of TV debates happens all day to spread rumours or violence some time so that we can analyze how much TV debates are responsible for spreading violence or dominance on single community? Please reply @benkrause Thanks!

salesforce / GeDi

GPT-3 integration with GeDi #2