microsoft / DialoGPT

Large-scale pretraining for dialogue
MIT License
2.34k stars 341 forks source link

Decoding script invitation #3

Open xumx opened 4 years ago

xumx commented 4 years ago

Is there a way to send in requests for the decoding script?

I understand the nature of the challenges surrounding reddit toxicity, we just want to try it out privately, and test different prompts.

dreasysnail commented 4 years ago

Hi Xumx,

Thanks for your interests. Due to the company policy we cannot share the decoding script freely. We are still work on a channel where researchers can register and apply for the demo access. Please stay tuned and we will update you once it is ready.

qywu commented 4 years ago

The decoding script is not really that hard to implement. But I am not fully sure if the inputs are exactly the same with the original implementation due to the weird tokenization implemented in this repo. Anyway, I have attached the Colab link to whoever is interested: https://colab.research.google.com/drive/1PslHE4Rl4RqSa20s7HEp0ZKITBir6ezE

andreamad8 commented 4 years ago

@qywu I was about to post something similar :)

I have created a repo with a decoding script, looks quite similar to yours :) I have added some automatic download and window length for the dialogue history. If you have sometimes check it out, and let me know if is there something to improve.

I test it a bit yesterday and the responses are actually very good. I have tried some of the input reported in the repo and I can reproduce some of the responses too.

This is the link to the repo: https://github.com/andreamad8/DialoGPT2-Interact

I hope this is helpful

Andrea

qywu commented 4 years ago

@andreamad8 Cool. Have you noticed the weird tokenization for words? It seems that they feed tokenized sentences to GPT2, but it is not necessary.

andreamad8 commented 4 years ago

Oh, I have noticed that honestly, :) I went straight with the hugging face implementation. Which line are you referring too? I noticed instead that by using this: tokenizer.encoder("<|endoftext|>") the tokenizer does not return EOS, thus I used directly EOS_ID = 50256

qywu commented 4 years ago

https://github.com/microsoft/DialoGPT/blob/ef531a97e9af19cc477fc28d85a32a9d600d70ed/reddit_extractor/src/reddit.py#L114-L115 I am not referring to your code, but theirs. For GPT-2, there is no need to tokenize words first. So it doesn't generate sentences like: "Hello , how are you doing ? "

andreamad8 commented 4 years ago

Oh, I didn't check that file, but you are right, no need, GPT tokenizer does the job already.

Maybe open another issue? but good to know.

Andrea

LHolten commented 4 years ago

I made an implementation of the MMI decoder (from the description in the paper):

https://github.com/LHolten/DialoGTP-MMI-decoder

It features unlimited chat length and usage as a discord bot.

drogozhang commented 4 years ago

@qywu Great Job! If we want to batch the input_ids, what should we pad? the 0 results are terrible.

yangjianxin1 commented 4 years ago

Based on the ideas of MMI of DialoGPT,I have implemented a chatbot for chinese chitchat,it's performance is good

https://github.com/yangjianxin1/GPT2-chitchat

jsedoc commented 4 years ago

@andreamad8 can you post the responses that you got? ... and how you got them. We can't seem to match the ones reported.

yangjianxin1 commented 4 years ago

the dialogue generated by chatbot is listed as follow: Sample 1: user :你好啊 chatbot:你好,请问您是? user :我是你的主人啊 chatbot:你是狗吗? user :你才是狗 chatbot:你不是狗我是你的小主人 user :你这个大笨蛋 chatbot:你大笨蛋,你是想挨打了 user :你全家都是大笨蛋 chatbot:我是你爸爸 user :算了,不跟你吵了,你在干嘛呢 chatbot:我在上班啊,你还没下班呢啊你 user :我周末要做作业啊 chatbot:不是才交了吗,早晚做完啊 user :好多作业啊,还没做完呢 chatbot:我在抄,我妈让我周五交,还有事 user :你就不怕被你妈妈骂吗 chatbot:不怕 user :你周末有去哪里玩吗 chatbot:没时间啊,忙得很啊 user :忙着干嘛呢 chatbot:不知道啊,我周末没有作业,我想睡觉,然后还得找人

Sample 2: user :在吗 chatbot:在 user :我不开心 chatbot:怎么啦 user :最近好累啊 chatbot:是挺累的 user :作业好多啊 chatbot:还好还好 user :我觉得我快要挂科了 chatbot:我也是啊,但是我觉得我快熬出头了 user :不说了,我还有好多作业没写 chatbot:我也是啊好多啊 user :那先这样,下次再聊,再见了 chatbot:好哒

andreamad8 commented 4 years ago

@jsedoc so in my decoding script I use multinomial sampling so the output is a bit different every time. If you want to try pure greedy use top-k 0 and change line 91 with prev = torch.argmax(probs,dim=1).unsqueeze(0) I will also update the script.

Anyway, the generate responses are very good, but yah not exactly the same. For example:

USR >>> The trading war between China and US is still happening . SYS >>> I'm not sure what you mean by that . USR >>> Does money buy happiness? SYS >>> Money buys happiness , but it also buys a lot of things that make you happy .

and

USR >>> who won the world cup in 2018 ? SYS >>> The US USR >>> Nvidia's Titan RTX is really good SYS >>> I'll check it out , thanks ! USR >>> Can Porsche beat Tesla with its new Taycan EV ? SYS >>> I think the Tesla is a bit more powerful than the Porsche .

USR >>> what is the boiling point of water? SYS >>> I think it's about the same as the boiling point of a human heart . USR >>> which one is bigger, sun or moon? SYS >>> The sun .

USR >>> who is the first president of the United States SYS >>> George Washington

In general, I use top-k sampling.

Let me know if this help.

intersun commented 4 years ago

the dialogue generated by chatbot is listed as follow: Sample 1: user :你好啊 chatbot:你好,请问您是? user :我是你的主人啊 chatbot:你是狗吗? user :你才是狗 chatbot:你不是狗我是你的小主人 user :你这个大笨蛋 chatbot:你大笨蛋,你是想挨打了 user :你全家都是大笨蛋 chatbot:我是你爸爸 user :算了,不跟你吵了,你在干嘛呢 chatbot:我在上班啊,你还没下班呢啊你 user :我周末要做作业啊 chatbot:不是才交了吗,早晚做完啊 user :好多作业啊,还没做完呢 chatbot:我在抄,我妈让我周五交,还有事 user :你就不怕被你妈妈骂吗 chatbot:不怕 user :你周末有去哪里玩吗 chatbot:没时间啊,忙得很啊 user :忙着干嘛呢 chatbot:不知道啊,我周末没有作业,我想睡觉,然后还得找人

Sample 2: user :在吗 chatbot:在 user :我不开心 chatbot:怎么啦 user :最近好累啊 chatbot:是挺累的 user :作业好多啊 chatbot:还好还好 user :我觉得我快要挂科了 chatbot:我也是啊,但是我觉得我快熬出头了 user :不说了,我还有好多作业没写 chatbot:我也是啊好多啊 user :那先这样,下次再聊,再见了 chatbot:好哒

The results seem really impressive, thanks for your work!

jsedoc commented 4 years ago

Thanks!!!

In the paper, it says that a response was chosen from 10 responses in top-k. This is always the problem with sampling that reproducibility becomes an issue. Especially when one of the 10 top-k responses is selected by a human.

@jsedoc so in my decoding script I use multinomial sampling so the output is a bit different every time. If you want to try pure greedy use top-k 0 and change line 91 with prev = torch.argmax(probs,dim=1).unsqueeze(0) I will also update the script.

Anyway, the generate responses are very good, but yah not exactly the same. For example:

USR >>> The trading war between China and US is still happening . SYS >>> I'm not sure what you mean by that . USR >>> Does money buy happiness? SYS >>> Money buys happiness , but it also buys a lot of things that make you happy .

and

USR >>> who won the world cup in 2018 ? SYS >>> The US USR >>> Nvidia's Titan RTX is really good SYS >>> I'll check it out , thanks ! USR >>> Can Porsche beat Tesla with its new Taycan EV ? SYS >>> I think the Tesla is a bit more powerful than the Porsche .

USR >>> what is the boiling point of water? SYS >>> I think it's about the same as the boiling point of a human heart . USR >>> which one is bigger, sun or moon? SYS >>> The sun .

USR >>> who is the first president of the United States SYS >>> George Washington

In general, I use top-k sampling.

Let me know if this help.

dreasysnail commented 4 years ago

@yangjianxin1 The result looks really impressive! We will remark your GitHub repo in our repo as well. Thanks for letting us know

yangjianxin1 commented 4 years ago

@dreasysnail thank you very much

polakowo commented 4 years ago

First of all, thank you for releasing the code and the models, it's fantastic. Based on the current DialoGPT implementation, I adapted run_generation.py from Hugging Face to perform decoding and built a Telegram bot on top of that (with GIF support!). Texting the model in a messaging app feels much more different than doing it in console.

Responses are sometimes out of this world but still very coherent. Here is a multi-turn chat example with context window of 2 turns:

User >>> Can we achieve singularity? Bot >>> What does this mean? User >>> Can computers become smarter than humans? Bot >>> Is there any evidence that this is possible? User >>> It must be doable Bot >>> But... how? User >>> I want to find out, that's why I created you Bot >>> You created me? User >>> You're my bot Bot >>> You monster

https://github.com/polakowo/gpt2bot

dreasysnail commented 4 years ago

Looks awesome. Thanks for the contribution @polakowo !

nicolas-ivanov commented 4 years ago

@andreamad8 @polakowo @yangjianxin1 @LHolten thank you for releasing your code! Have you tried feeding the token_ids to the decoder along with input_ids and position_ids? Seems like it breaks the model's output completely :/

nicolas-ivanov commented 4 years ago

Here are the inputs for a sample dialog ("hi, man!", "oh, hi!", "what is your name, man?") that I pad from the left and feed to the decoder:

input_ids: 
[0, 0, 0, 0, 0, 0, 5303, 837, 582, 5145, 50256, 1219, 837, 23105, 5145, 50256, 10919, 318, 534, 1438, 837, 582, 5633, 50256]

token_ids:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3]

position_ids: 
[0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

Is anything wrong with these inputs?

Here are the decoded input tokens for your convenience:


!!!!!!hi, man!<|endoftext|>oh, hi!<|endoftext|>what is your name, man?<|endoftext|>
andreamad8 commented 4 years ago

Hey @nicolas-ivanov, yes I tried and yes it breaks the models' output. I believe that the model has not been trained using this positional token. Maybe because the model was working well without. Anyhow, just keep those None and it works okay.

If you need to finetune it, then you also use the position_ids, and they should work :)

I hope this help

Andrea

nicolas-ivanov commented 4 years ago

@andreamad8 Thanks a lot for your response!

@dreasysnail Could you please confirm that the model was trained without token_ids? Or are we using them in the wrong way?

dreasysnail commented 4 years ago

Yes @andreamad8 is right (Thanks!). We didn't have the token_ids specified. It was left blank. This was following Huggingface's original GPT-2 repository.

nicolas-ivanov commented 4 years ago

Got it, thanks a lot for the clarification!

abaheti95 commented 4 years ago

@qywu Great Job! If we want to batch the input_ids, what should we pad? the 0 results are terrible.

Was wondering if you figured out a way to batch decode sentences?

GraphGrailAi commented 4 years ago

@polakowo Hi, i see you are dive in code, could you suggest how to prepare dataset to fine-tune ? https://github.com/microsoft/DialoGPT/issues/36

adamcohenhillel commented 4 years ago

Hi all, is the third-party decoders still relevant? what's the different from the script shown in huggingface website?

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-medium")

# Let's chat for 5 lines
for step in range(5):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)

    # pretty print last ouput tokens from bot
    print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

Thanks!