ncoop57 / i-am-a-nerd

Apache License 2.0
32 stars 8 forks source link

Open-Dialog Chatbots for Learning New Languages [Part 1] | IAmANerd #4

Open utterances-bot opened 4 years ago

utterances-bot commented 4 years ago

Open-Dialog Chatbots for Learning New Languages [Part 1] | IAmANerd

How to fine-tune the DialoGPT model on a new dataset or language for open-dialog conversational chatbots.

https://nathancooper.io/i-am-a-nerd/chatbot/deep-learning/gpt2/2020/05/12/chatbot-part-1.html

virattt commented 4 years ago

Thank you! Both this notebook and the Data Preprocessing colab are incredibly helpful.

ncoop57 commented 4 years ago

@virattt Glad it was helpful!

monisha08041998 commented 4 years ago

is there a limit for the set of conversations?

ncoop57 commented 4 years ago

@monisha08041998 It was only trained on a max length of 9 conversations, so going beyond that may leads to poor results. Also, the max length of the entire conversation that GPT-2 will consider when predicting new tokens is 512, so even if you go beyond that it can only look at that many for context.

bhuvan1643 commented 4 years ago

how are the pre-trained weights going to help here as the new data is completely different

bhuvan1643 commented 4 years ago

how did you use pre-trained tokenizer here, as pre-trained one contains only english words but data here is spanish

ncoop57 commented 4 years ago

@bhuvan1643 DialoGPT used the original GPT2 model, pretrained weights, and tokenizer. Even though the vast majority of the data was English, it still contained some Spanish text and therefore the necessary Spanish characters/words.

I am not 100% sure the pre-trained weights help with modeling the Spanish language. However, Spanish has a lot of overlap in vocabulary and grammatical structure with English since they are both romance languages like French and German. This overlap may help the model transfer its knowledge from English to Spanish.

I'm not sure how well this would work on non-romance languages like Chinese, Hindi, etc since there are almost no overlap even if you converted the words/characters to their Latin versions.

TheHmmka commented 4 years ago

Where did you train large model? Is there any cloud service or something like that?

ncoop57 commented 3 years ago

@TheHmmka I trained the larger model on one of my school's machines that had 4 1080ti's. I'm sure you could train it on a cloud service relatively easily though, but I've never had experience with those.

etrigger commented 3 years ago

I cannot download the data( subtitles of Spanish TV shows ), the script to generate a csv cannot be accessed either. Can you please fix them? Thanks

ncoop57 commented 3 years ago

Hey @etrigger, could you show me the error you are getting when trying to download or generate the data? I tried to reproduce this, but it was working for me

etrigger commented 3 years ago

I can download the data now, but the script can't be opened. https://colab.research.google.com/drive/1kKErlSSpewQbWexFPEj1rPWsYpMx69ZS?usp=sharing

Here are the error: A network error occurred and the request could not be completed.

https://drive.google.com/drive/?action=locate&id=1kKErlSSpewQbWexFPEj1rPWsYpMx69ZS&authuser=0 A network error occurred and the request could not be completed. GapiError: A network error occurred and the request could not be completed. at pz.Vs [as constructor] (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:704:150) at new pz (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:1225:318) at Da.program_ (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:1359:470) at Fa (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:19:336) at Da.throw_ (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:18:402) at Ia.throw (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:20:248) at g (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:62:155)

ncoop57 commented 3 years ago

@etrigger what an interesting error. I did a bit of digging and it seems to be an issue with colab in certain situation. Here is an issue about it: https://github.com/googlecolab/colabtools/issues/1771, but it seem like it just automagically got fixed for the person who opened it. I'd recommend trying with a different browser or in incognito mode on the browser you are using to see if that fixes it. I don't think there is anything I can do from my side other than giving you access to a converted python script so you can download it yourself. Here is a link to it where you could just download the file and run it locally if you want (though be careful because it takes a lot of compute, networking and memory to generate the CSV, especially for languages that have a ton of examples): https://drive.google.com/file/d/1qvIh3zztJT7TelMYLdahOoGmypw398VD/view?usp=sharing

etrigger commented 3 years ago

@ncoop57 Thanks for the script.

etrigger commented 3 years ago

@ncoop57 Question on preparing training data format? I have the dialog data like this: each line has the sentence A (source) followed by sentence B(target). How should I organize the data for training?

ncoop57 commented 3 years ago

@etrigger I have the format that dialoGPT requires in the data section of my blog: https://nathancooper.io/i-am-a-nerd/chatbot/deep-learning/gpt2/2020/05/12/chatbot-part-1.html#The-Data!. I recommend trying to first get it into a format that my code expects (each column having a different response) and then tossing it into that function to generate the necessary input data for your model

berkozg96 commented 3 years ago

I have a question about the defined train and evaluate functions. Both have: inputs, labels = (batch, batch) meaning that inputs and labels are exactly the same. My question is: Shouldn't the model try to learn how to respond to the given input? I feel like there is something wrong with that in this case.

Viile1 commented 1 year ago

You're correct, it doesn't make sense for the inputs and labels to be the same in a train or evaluation function for a conversational chatbot. The goal of the model is to learn to generate a response given an input, so the inputs should be questions or prompts, and the labels should be the corresponding answers. The model's performance is usually evaluated by comparing the generated response to the ground truth answer in the label variable.

If the inputs and labels are the same, the model would simply memorize the training data and wouldn't be able to generalize to new examples. So it's important to ensure that the inputs and labels are distinct, with the inputs being used to prompt the model to generate a response, and the labels being used to evaluate the quality of the generated response.