Can't figure out the format for the input data

oswaldoludwig / Seq2seq-Chatbot-for-Keras

This repository contains a new generative model of chatbot based on seq2seq modeling.

Apache License 2.0

331 stars 98 forks source link

Can't figure out the format for the input data #3

Closed jld23 closed 7 years ago

jld23 commented 7 years ago

This is a great example! I'm trying to train my own set of conversations based off or your code, but I can't find the format for movie_data.txt that is referenced in split_qa.py (line 8).

Can you put a sample of that data or describe what my input data needs to look like to work with your infrastructure?

Thanks!

oswaldoludwig commented 7 years ago

Thanks! I made available the training data that I collected for this pre-trained model, see https://www.dropbox.com/sh/o0rze9dulwmon8b/AAA6g6QoKM8hBEHGst6W4JGDa?dl=0 . Therefore, now you can repeat all the process to obtain the trained bot. You will see that you can have the bot chatting with you after only 50 training epoch! (half hour in GPU by setting the learning rate as lr=0.0005). This is not a usual seq2seq model, this model can learn faster. I changed split_qa.py (line 8) to set the correct file name.

Good luck!

jld23 commented 7 years ago

Thank you! The file at line 2752 begins to have B: and A: can explain the significance of those prefixes? I don't see anything in the code.

oswaldoludwig commented 7 years ago

This means person a and b. These prefixes are filtered by the algorithm, it doesn't matter.

iuria21 commented 5 years ago

Hi, thanks for this repo. I was seeing the dialog_simple` and i was wondering if there is some key to separate one dialogue from other, or all the file is token as a single dialogue. Thanks again!