Open hunkim opened 7 years ago
Are there guidelines for data formats?
@jybaek For DeepQA/fbdata like_count[TAB]post[TAB]like_count_comment1[TAB]comment1[TAB]like_count_comment2[TAB]comment2[TAB]...like_count_comment_n[TAB]comment_n[TAB]
Need to transform this for chatbot-retrieval. For the data format for chatbot-retrieval, please check out the code. I haven't looked it yet.
@hunkim For our dataset, post_body
will be the context while comment_body
will be the utterance. How should we deal with the posts that have many comments? Do you want to pick the comment with the most likes as the utterance or duplicate the post for each comment?
@woniesong92 For DeepQA/fbdata.py, I used: p c1 p c2 ... p cn
I would try many variations: p c (with most likes)
p c1 (first comment only)
etc.
Good luck!
@hunkim another PR: https://github.com/hunkimForks/chatbot-retrieval/pull/3
Do you have an idea if I should generate vocab
to run chatbot-retrieval
?
Not sure if there's a good Korean tokenizer that I can use.
Anyone interested in creating data for chatbot-retrieval?
You can use the data in https://github.com/hunkimForks/DeepQA/tree/master/data/fbdata. We need to change the format of the data to feed them to chatbot-retrieval.