thu-coai / KdConv

KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation
Apache License 2.0
459 stars 62 forks source link

Comfused about the size of the datasets #1

Closed gmftbyGMFTBY closed 4 years ago

gmftbyGMFTBY commented 4 years ago

Hi, first of all, thanks for your wonderful work.

After processing the datasets, I found that the size of the dataset is different from the claim in the paper. In the paper, you mentioned that each domain contains 1.5k dialogs, but I can only obtain 1.2k for each domain.

Maybe I did something wrong, can you help me troubleshoot the issue?

Thank you so much.

gmftbyGMFTBY commented 4 years ago

Specifically, the processed codes are shown below:

import json
import ipdb
from tqdm import tqdm
import numpy as np

def read_file(path):
    dialogs = []
    with open(path) as f:
        data = json.load(f)
        for i in data:
            i = i['messages']
            i = [item['message'] for item in i]
            dialogs.append(i)
    return dialogs

if __name__ == "__main__":
    data = []
    data.extend(read_file('music.json'))
    data.extend(read_file('movie.json'))
    data.extend(read_file('travel.json'))
    print(f'utterance size: {np.sum([len(i) for i in data])}')
    print(f'dialog size: {len(data)}')

The metadata of the dataset that I obtained is shown below: 1

chujiezheng commented 4 years ago

Thanks for your interest. We reserve the development and test sets for the follow-up competition. Thus we only release the training sets at present.

gmftbyGMFTBY commented 4 years ago

Thank you