Open utterances-bot opened 4 years ago
Thank you! Both this notebook and the Data Preprocessing colab are incredibly helpful.
@virattt Glad it was helpful!
is there a limit for the set of conversations?
@monisha08041998 It was only trained on a max length of 9 conversations, so going beyond that may leads to poor results. Also, the max length of the entire conversation that GPT-2 will consider when predicting new tokens is 512, so even if you go beyond that it can only look at that many for context.
how are the pre-trained weights going to help here as the new data is completely different
how did you use pre-trained tokenizer here, as pre-trained one contains only english words but data here is spanish
@bhuvan1643 DialoGPT used the original GPT2 model, pretrained weights, and tokenizer. Even though the vast majority of the data was English, it still contained some Spanish text and therefore the necessary Spanish characters/words.
I am not 100% sure the pre-trained weights help with modeling the Spanish language. However, Spanish has a lot of overlap in vocabulary and grammatical structure with English since they are both romance languages like French and German. This overlap may help the model transfer its knowledge from English to Spanish.
I'm not sure how well this would work on non-romance languages like Chinese, Hindi, etc since there are almost no overlap even if you converted the words/characters to their Latin versions.
Where did you train large model? Is there any cloud service or something like that?
@TheHmmka I trained the larger model on one of my school's machines that had 4 1080ti's. I'm sure you could train it on a cloud service relatively easily though, but I've never had experience with those.
I cannot download the data( subtitles of Spanish TV shows ), the script to generate a csv cannot be accessed either. Can you please fix them? Thanks
Hey @etrigger, could you show me the error you are getting when trying to download or generate the data? I tried to reproduce this, but it was working for me
I can download the data now, but the script can't be opened. https://colab.research.google.com/drive/1kKErlSSpewQbWexFPEj1rPWsYpMx69ZS?usp=sharing
Here are the error: A network error occurred and the request could not be completed.
https://drive.google.com/drive/?action=locate&id=1kKErlSSpewQbWexFPEj1rPWsYpMx69ZS&authuser=0 A network error occurred and the request could not be completed. GapiError: A network error occurred and the request could not be completed. at pz.Vs [as constructor] (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:704:150) at new pz (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:1225:318) at Da.program_ (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:1359:470) at Fa (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:19:336) at Da.throw_ (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:18:402) at Ia.throw (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:20:248) at g (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:62:155)
@etrigger what an interesting error. I did a bit of digging and it seems to be an issue with colab in certain situation. Here is an issue about it: https://github.com/googlecolab/colabtools/issues/1771, but it seem like it just automagically got fixed for the person who opened it. I'd recommend trying with a different browser or in incognito mode on the browser you are using to see if that fixes it. I don't think there is anything I can do from my side other than giving you access to a converted python script so you can download it yourself. Here is a link to it where you could just download the file and run it locally if you want (though be careful because it takes a lot of compute, networking and memory to generate the CSV, especially for languages that have a ton of examples): https://drive.google.com/file/d/1qvIh3zztJT7TelMYLdahOoGmypw398VD/view?usp=sharing
@ncoop57 Thanks for the script.
@ncoop57 Question on preparing training data format? I have the dialog data like this: each line has the sentence A (source) followed by sentence B(target). How should I organize the data for training?
@etrigger I have the format that dialoGPT requires in the data section of my blog: https://nathancooper.io/i-am-a-nerd/chatbot/deep-learning/gpt2/2020/05/12/chatbot-part-1.html#The-Data!. I recommend trying to first get it into a format that my code expects (each column having a different response) and then tossing it into that function to generate the necessary input data for your model
I have a question about the defined train and evaluate functions. Both have:
inputs, labels = (batch, batch)
meaning that inputs and labels are exactly the same. My question is: Shouldn't the model try to learn how to respond to the given input? I feel like there is something wrong with that in this case.
You're correct, it doesn't make sense for the inputs and labels to be the same in a train or evaluation function for a conversational chatbot. The goal of the model is to learn to generate a response given an input, so the inputs should be questions or prompts, and the labels should be the corresponding answers. The model's performance is usually evaluated by comparing the generated response to the ground truth answer in the label variable.
If the inputs and labels are the same, the model would simply memorize the training data and wouldn't be able to generalize to new examples. So it's important to ensure that the inputs and labels are distinct, with the inputs being used to prompt the model to generate a response, and the labels being used to evaluate the quality of the generated response.
Open-Dialog Chatbots for Learning New Languages [Part 1] | IAmANerd
How to fine-tune the DialoGPT model on a new dataset or language for open-dialog conversational chatbots.
https://nathancooper.io/i-am-a-nerd/chatbot/deep-learning/gpt2/2020/05/12/chatbot-part-1.html