How to prepare Corpus? - Githubissues

lukalabs / cakechat

CakeChat: Emotional Generative Dialog System

Apache License 2.0

1.7k stars 935 forks source link

How to prepare Corpus? #18

Closed aploleb closed 6 years ago

aploleb commented 6 years ago

I have a corpus file, but I don't know how to easily make it have "Each line of the corpus file should be a JSON object containing a list of dialog messages sorted in chronological order."

Is there a tool where I can take a downloaded corpus and translate it to cakechat format

stickyburn commented 6 years ago

What the JSON format and sorting them in chronological order means is that the corpus should be JSON objects and the messages should be provided in a conversational way. The set should go somewhat like : [{"text" : "A", "condition": "A1"}, {"text": "B", "condition" : "A2"}, {"text": "A", "condition": "A3"}] , where A and B are different messages but in chronological order. As for the tools, you need to find something which will take sentences of messages and form them in A B A B format, and then it also needs to extract common conditions from each sentence which is passed as the 'condition' argument.

aploleb commented 6 years ago

I really don't understand what kind of tool I can even use to do that.

nikitos9000 commented 6 years ago

Hi @aploleb, thanks for your interest!

You can take any dataset of conversational turns you have, and preprocess it to produce files in the same format as in data/corpora_processed/train_processed_dialogs.txt (as stated in Training your own model in our README).

You can use json module in python to form this dataset, but it really depends on the format of your own dataset and your favorite toolset.

josemf commented 6 years ago

I'll use this thread since it's a related question.

Currently I have a corpus in the right json format and I'm able to start training on my corpus. I've basically split the corpus in the two train_processed_dialogs.txt and val_processed_dialogs.txt (30% of the corpus dialogs).

I've a question regarding the files in the quality directory. Should I use sentences I'm already using in the train and validation sets or should I split my corpus into such sets that none of them intersect any sentence?

Thanks for sharing this framework with the community!

nikitos9000 commented 6 years ago

Hi @josemf,

It's better to have these quality datasets out of your main train set because these samples are used for quality evaluation of the trained model. Otherwise, you can get overfitted metrics.

Hope that helps!

stickyburn commented 6 years ago

Guys I have another question related to corpus so Ill post here. If my training data has really long conversations, but not many instances of those conversations, how will this affect my training? Since I have long convos, I am putting all of those in one JSON object. But in total, I only have say 20 conversations which makes it only JSON objects.

nikitos9000 commented 6 years ago

@yashank09 sorry for a delayed response. I think you should just try to train your model this way — I hope it will converge well if you have enough data because no matter how many different conversations you have, the training set itself is built just as a list of (dialog context -> response) pairs, taken from adjacent utterances in the given conversation.

aploleb commented 6 years ago

Can you make a tool to prepare Jason objects?

khalman-m commented 6 years ago

Hi, @aploleb

We don't know what is the data format of your corpus, so it is unclear what this tool should do. You can prepare Json lines using https://docs.python.org/2/library/json.html