Data generation for chatbot-retrieval?

hunkim / deep-facebook-commenter

9 stars 3 forks source link

Data generation for chatbot-retrieval? #3

Open hunkim opened 7 years ago

hunkim commented 7 years ago

Anyone interested in creating data for chatbot-retrieval?

You can use the data in https://github.com/hunkimForks/DeepQA/tree/master/data/fbdata. We need to change the format of the data to feed them to chatbot-retrieval.

jybaek commented 7 years ago

Are there guidelines for data formats?

hunkim commented 7 years ago

@jybaek For DeepQA/fbdata like_count[TAB]post[TAB]like_count_comment1[TAB]comment1[TAB]like_count_comment2[TAB]comment2[TAB]...like_count_comment_n[TAB]comment_n[TAB]

Need to transform this for chatbot-retrieval. For the data format for chatbot-retrieval, please check out the code. I haven't looked it yet.

woniesong92 commented 7 years ago

@hunkim For our dataset, post_body will be the context while comment_body will be the utterance. How should we deal with the posts that have many comments? Do you want to pick the comment with the most likes as the utterance or duplicate the post for each comment?

hunkim commented 7 years ago

@woniesong92 For DeepQA/fbdata.py, I used: p c1 p c2 ... p cn

I would try many variations: p c (with most likes)

p c1 (first comment only)

etc.

Good luck!

woniesong92 commented 7 years ago

@hunkim another PR: https://github.com/hunkimForks/chatbot-retrieval/pull/3

Do you have an idea if I should generate vocab to run chatbot-retrieval? Not sure if there's a good Korean tokenizer that I can use.