Closed aploleb closed 6 years ago
What the JSON format and sorting them in chronological order means is that the corpus should be JSON objects and the messages should be provided in a conversational way. The set should go somewhat like :
[{"text" : "A", "condition": "A1"}, {"text": "B", "condition" : "A2"}, {"text": "A", "condition": "A3"}]
, where A and B are different messages but in chronological order. As for the tools, you need to find something which will take sentences of messages and form them in A B A B format, and then it also needs to extract common conditions from each sentence which is passed as the 'condition' argument.
I really don't understand what kind of tool I can even use to do that.
Hi @aploleb, thanks for your interest!
You can take any dataset of conversational turns you have, and preprocess it to produce files in the same format as in data/corpora_processed/train_processed_dialogs.txt (as stated in Training your own model in our README).
You can use json
module in python to form this dataset, but it really depends on the format of your own dataset and your favorite toolset.
I'll use this thread since it's a related question.
Currently I have a corpus in the right json format and I'm able to start training on my corpus. I've basically split the corpus in the two train_processed_dialogs.txt
and val_processed_dialogs.txt
(30% of the corpus dialogs).
I've a question regarding the files in the quality
directory. Should I use sentences I'm already using in the train and validation sets or should I split my corpus into such sets that none of them intersect any sentence?
Thanks for sharing this framework with the community!
Hi @josemf,
It's better to have these quality
datasets out of your main train set because these samples are used for quality evaluation of the trained model. Otherwise, you can get overfitted metrics.
Hope that helps!
Guys I have another question related to corpus so Ill post here. If my training data has really long conversations, but not many instances of those conversations, how will this affect my training? Since I have long convos, I am putting all of those in one JSON object. But in total, I only have say 20 conversations which makes it only JSON objects.
@yashank09 sorry for a delayed response. I think you should just try to train your model this way — I hope it will converge well if you have enough data because no matter how many different conversations you have, the training set itself is built just as a list of (dialog context -> response) pairs, taken from adjacent utterances in the given conversation.
Can you make a tool to prepare Jason objects?
Hi, @aploleb
We don't know what is the data format of your corpus, so it is unclear what this tool should do. You can prepare Json lines using https://docs.python.org/2/library/json.html
I have a corpus file, but I don't know how to easily make it have "Each line of the corpus file should be a JSON object containing a list of dialog messages sorted in chronological order."
Is there a tool where I can take a downloaded corpus and translate it to cakechat format