Open xumx opened 5 years ago
Hi Xumx,
Thanks for your interests. Due to the company policy we cannot share the decoding script freely. We are still work on a channel where researchers can register and apply for the demo access. Please stay tuned and we will update you once it is ready.
The decoding script is not really that hard to implement. But I am not fully sure if the inputs are exactly the same with the original implementation due to the weird tokenization implemented in this repo. Anyway, I have attached the Colab link to whoever is interested: https://colab.research.google.com/drive/1PslHE4Rl4RqSa20s7HEp0ZKITBir6ezE
@qywu I was about to post something similar :)
I have created a repo with a decoding script, looks quite similar to yours :) I have added some automatic download and window length for the dialogue history. If you have sometimes check it out, and let me know if is there something to improve.
I test it a bit yesterday and the responses are actually very good. I have tried some of the input reported in the repo and I can reproduce some of the responses too.
This is the link to the repo: https://github.com/andreamad8/DialoGPT2-Interact
I hope this is helpful
Andrea
@andreamad8 Cool. Have you noticed the weird tokenization for words? It seems that they feed tokenized sentences to GPT2, but it is not necessary.
Oh, I have noticed that honestly, :) I went straight with the hugging face implementation. Which line are you referring too? I noticed instead that by using this: tokenizer.encoder("<|endoftext|>") the tokenizer does not return EOS, thus I used directly EOS_ID = 50256
https://github.com/microsoft/DialoGPT/blob/ef531a97e9af19cc477fc28d85a32a9d600d70ed/reddit_extractor/src/reddit.py#L114-L115 I am not referring to your code, but theirs. For GPT-2, there is no need to tokenize words first. So it doesn't generate sentences like: "Hello , how are you doing ? "
Oh, I didn't check that file, but you are right, no need, GPT tokenizer does the job already.
Maybe open another issue? but good to know.
Andrea
I made an implementation of the MMI decoder (from the description in the paper):
https://github.com/LHolten/DialoGTP-MMI-decoder
It features unlimited chat length and usage as a discord bot.
@qywu Great Job! If we want to batch the input_ids, what should we pad? the 0 results are terrible.
Based on the ideas of MMI of DialoGPT,I have implemented a chatbot for chinese chitchat,it's performance is good
@andreamad8 can you post the responses that you got? ... and how you got them. We can't seem to match the ones reported.
the dialogue generated by chatbot is listed as follow: Sample 1: user :你好啊 chatbot:你好,请问您是? user :我是你的主人啊 chatbot:你是狗吗? user :你才是狗 chatbot:你不是狗我是你的小主人 user :你这个大笨蛋 chatbot:你大笨蛋,你是想挨打了 user :你全家都是大笨蛋 chatbot:我是你爸爸 user :算了,不跟你吵了,你在干嘛呢 chatbot:我在上班啊,你还没下班呢啊你 user :我周末要做作业啊 chatbot:不是才交了吗,早晚做完啊 user :好多作业啊,还没做完呢 chatbot:我在抄,我妈让我周五交,还有事 user :你就不怕被你妈妈骂吗 chatbot:不怕 user :你周末有去哪里玩吗 chatbot:没时间啊,忙得很啊 user :忙着干嘛呢 chatbot:不知道啊,我周末没有作业,我想睡觉,然后还得找人
Sample 2: user :在吗 chatbot:在 user :我不开心 chatbot:怎么啦 user :最近好累啊 chatbot:是挺累的 user :作业好多啊 chatbot:还好还好 user :我觉得我快要挂科了 chatbot:我也是啊,但是我觉得我快熬出头了 user :不说了,我还有好多作业没写 chatbot:我也是啊好多啊 user :那先这样,下次再聊,再见了 chatbot:好哒
@jsedoc so in my decoding script I use multinomial sampling so the output is a bit different every time. If you want to try pure greedy use top-k 0 and change line 91 with
prev = torch.argmax(probs,dim=1).unsqueeze(0)
I will also update the script.
Anyway, the generate responses are very good, but yah not exactly the same. For example:
USR >>> The trading war between China and US is still happening . SYS >>> I'm not sure what you mean by that . USR >>> Does money buy happiness? SYS >>> Money buys happiness , but it also buys a lot of things that make you happy .
and
USR >>> who won the world cup in 2018 ? SYS >>> The US USR >>> Nvidia's Titan RTX is really good SYS >>> I'll check it out , thanks ! USR >>> Can Porsche beat Tesla with its new Taycan EV ? SYS >>> I think the Tesla is a bit more powerful than the Porsche .
USR >>> what is the boiling point of water? SYS >>> I think it's about the same as the boiling point of a human heart . USR >>> which one is bigger, sun or moon? SYS >>> The sun .
USR >>> who is the first president of the United States SYS >>> George Washington
In general, I use top-k sampling.
Let me know if this help.
the dialogue generated by chatbot is listed as follow: Sample 1: user :你好啊 chatbot:你好,请问您是? user :我是你的主人啊 chatbot:你是狗吗? user :你才是狗 chatbot:你不是狗我是你的小主人 user :你这个大笨蛋 chatbot:你大笨蛋,你是想挨打了 user :你全家都是大笨蛋 chatbot:我是你爸爸 user :算了,不跟你吵了,你在干嘛呢 chatbot:我在上班啊,你还没下班呢啊你 user :我周末要做作业啊 chatbot:不是才交了吗,早晚做完啊 user :好多作业啊,还没做完呢 chatbot:我在抄,我妈让我周五交,还有事 user :你就不怕被你妈妈骂吗 chatbot:不怕 user :你周末有去哪里玩吗 chatbot:没时间啊,忙得很啊 user :忙着干嘛呢 chatbot:不知道啊,我周末没有作业,我想睡觉,然后还得找人
Sample 2: user :在吗 chatbot:在 user :我不开心 chatbot:怎么啦 user :最近好累啊 chatbot:是挺累的 user :作业好多啊 chatbot:还好还好 user :我觉得我快要挂科了 chatbot:我也是啊,但是我觉得我快熬出头了 user :不说了,我还有好多作业没写 chatbot:我也是啊好多啊 user :那先这样,下次再聊,再见了 chatbot:好哒
The results seem really impressive, thanks for your work!
Thanks!!!
In the paper, it says that a response was chosen from 10 responses in top-k. This is always the problem with sampling that reproducibility becomes an issue. Especially when one of the 10 top-k responses is selected by a human.
@jsedoc so in my decoding script I use multinomial sampling so the output is a bit different every time. If you want to try pure greedy use top-k 0 and change line 91 with
prev = torch.argmax(probs,dim=1).unsqueeze(0)
I will also update the script.Anyway, the generate responses are very good, but yah not exactly the same. For example:
USR >>> The trading war between China and US is still happening . SYS >>> I'm not sure what you mean by that . USR >>> Does money buy happiness? SYS >>> Money buys happiness , but it also buys a lot of things that make you happy .
and
USR >>> who won the world cup in 2018 ? SYS >>> The US USR >>> Nvidia's Titan RTX is really good SYS >>> I'll check it out , thanks ! USR >>> Can Porsche beat Tesla with its new Taycan EV ? SYS >>> I think the Tesla is a bit more powerful than the Porsche .
USR >>> what is the boiling point of water? SYS >>> I think it's about the same as the boiling point of a human heart . USR >>> which one is bigger, sun or moon? SYS >>> The sun .
USR >>> who is the first president of the United States SYS >>> George Washington
In general, I use top-k sampling.
Let me know if this help.
@yangjianxin1 The result looks really impressive! We will remark your GitHub repo in our repo as well. Thanks for letting us know
@dreasysnail thank you very much
First of all, thank you for releasing the code and the models, it's fantastic. Based on the current DialoGPT implementation, I adapted run_generation.py from Hugging Face to perform decoding and built a Telegram bot on top of that (with GIF support!). Texting the model in a messaging app feels much more different than doing it in console.
Responses are sometimes out of this world but still very coherent. Here is a multi-turn chat example with context window of 2 turns:
User >>> Can we achieve singularity? Bot >>> What does this mean? User >>> Can computers become smarter than humans? Bot >>> Is there any evidence that this is possible? User >>> It must be doable Bot >>> But... how? User >>> I want to find out, that's why I created you Bot >>> You created me? User >>> You're my bot Bot >>> You monster
Looks awesome. Thanks for the contribution @polakowo !
@andreamad8 @polakowo @yangjianxin1 @LHolten thank you for releasing your code! Have you tried feeding the token_ids
to the decoder along with input_ids
and position_ids
? Seems like it breaks the model's output completely :/
Here are the inputs for a sample dialog ("hi, man!", "oh, hi!", "what is your name, man?"
) that I pad from the left and feed to the decoder:
input_ids:
[0, 0, 0, 0, 0, 0, 5303, 837, 582, 5145, 50256, 1219, 837, 23105, 5145, 50256, 10919, 318, 534, 1438, 837, 582, 5633, 50256]
token_ids:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3]
position_ids:
[0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
Is anything wrong with these inputs?
Here are the decoded input tokens for your convenience:
!!!!!!hi, man!<|endoftext|>oh, hi!<|endoftext|>what is your name, man?<|endoftext|>
Hey @nicolas-ivanov, yes I tried and yes it breaks the models' output. I believe that the model has not been trained using this positional token. Maybe because the model was working well without. Anyhow, just keep those None and it works okay.
If you need to finetune it, then you also use the position_ids, and they should work :)
I hope this help
Andrea
@andreamad8 Thanks a lot for your response!
@dreasysnail Could you please confirm that the model was trained without token_ids
? Or are we using them in the wrong way?
Yes @andreamad8 is right (Thanks!). We didn't have the token_ids
specified. It was left blank. This was following Huggingface's original GPT-2 repository.
Got it, thanks a lot for the clarification!
@qywu Great Job! If we want to batch the input_ids, what should we pad? the 0 results are terrible.
Was wondering if you figured out a way to batch decode sentences?
@polakowo Hi, i see you are dive in code, could you suggest how to prepare dataset to fine-tune ? https://github.com/microsoft/DialoGPT/issues/36
Hi all, is the third-party decoders still relevant? what's the different from the script shown in huggingface website?
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-medium")
# Let's chat for 5 lines
for step in range(5):
# encode the new user input, add the eos_token and return a tensor in Pytorch
new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
# append the new user input tokens to the chat history
bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids
# generated a response while limiting the total chat history to 1000 tokens,
chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
# pretty print last ouput tokens from bot
print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))
Thanks!
Is there a way to send in requests for the decoding script?
I understand the nature of the challenges surrounding reddit toxicity, we just want to try it out privately, and test different prompts.