Data Size and Split - Githubissues

MrShininnnnn commented 1 year ago

For Quora, there are actually 149,263 samples marked as "duplicated" in total, which can be used for paraphrasing. Instead of the data size (124K) reported in the paper (100K\4k\20k), is there a reason why not to use the full dataset? Thanks.

guxd commented 1 year ago

Thanks for pointing this out! The same numbers were reported in the baseline paper https://aclanthology.org/P19-1332.pdf We followed this paper and thought the total data size is 124k. We actually used the 149,263 samples and split them into 100k\4k\remaining. We will update our paper to clarify this.

MrShininnnnn commented 1 year ago

To follow up, I found the released test dataset is of length 20k rather than "remaining". You may want to update the paper, as well as the public dataset.

MrShininnnnn commented 1 year ago

The baseline seems to follow the split of 100k\4k\20k, while the code and datasets of that baseline paper is not published. Could you please help to confirm that you are using "remaining", or actually both of you are using 20k for testing?

guxd commented 1 year ago

Sorry for the confusion! I checked the original code for data preparation. We randomly selected 100k for training and 240k for validation/testing without overlap.

    question_pairs = get_quora_data(data_path+'questions.csv')
    random.shuffle(question_pairs)
    data['train']=question_pairs[:100000]
    data['valid']=question_pairs[-24000:-20000]
    data['test']=question_pairs[-20000:]

def get_quora_data(data_path):
    """
    https://github.com/dev-chauhan/PQG-pytorch/blob/master/prepro/quora_prepro.py
    """
    pairs = []
    f_pairs = csv.reader(open(data_path, 'r', encoding='utf-8'))
    for row in f_pairs:
        idx, qid1, qid2, q1, q2, is_duplicate = row
        if is_duplicate=='1':
            pairs.append((q1, q2))
    return pairs

We will update our paper to indicate these details.

MrShininnnnn commented 1 year ago

Hi guxd, When I loaded the h5 file, I found the training size of the twitterurl dataset is 114,025 instead of 110K reported in the paper. Could you please share how you preprocess and split the twitterurl one as well?

The same is for the quora dataset. The data_len of the train h5 file is actually 125306 instead of 100K.

Thanks for the help.

guxd commented 1 year ago

Hi, here are the scripts for processing TwitterURL. We obtained three text files from the email response, so we set them as train/val/test sets individually.

data['train'] = get_twitterurl_data(data_path+'2016_Oct_10--2017_Jan_08_paraphrase.txt', 'train')
val_data = get_twitterurl_data(data_path+'Twitter_URL_Corpus_test.txt', 'valid') 
test_data = get_twitterurl_data(data_path+'Twitter_URL_Corpus_train.txt', 'test') 
data['valid'] = val_data[:1000]
random.shuffle(test_data)
data['test'] = test_data[:5000]

def get_twitterurl_data(data_path, task):
    """
    https://languagenet.github.io/
    https://github.com/lanwuwei/Twitter-URL-Corpus
    used in papers:https://arxiv.org/pdf/1909.03588.pdf, https://aclanthology.org/2020.acl-main.535.pdf, https://aclanthology.org/D18-1421.pdf
    """
    pairs = []
    texts = open(data_path, 'r', encoding='utf-8').readlines()
    if task == 'train':
        for line in tqdm(texts):
            if(len(line.split('\t'))!=2): 
                print(f'error@:{ascii(line)}')
                continue
            src, tar = line.split('\t')
            pairs.append((src.strip(), tar.strip()))
    else:
        for line in tqdm(texts):
            if(len(line.split('\t'))!=4): 
                print(f'error@:{ascii(line)}')
                continue
            src, tar, rate, url = line.split('\t')
            if int(rate[1])>3:
                pairs.append((src.strip(), tar.strip()))
    print(len(pairs))
    return pairs

guxd commented 1 year ago

I checked the revision history of my private repository and found an earlier version of script for quora preprocessing:

    question_pairs = get_quora_data(data_path+'questions.csv')
    random.shuffle(question_pairs)
    data['train']=question_pairs[:-26000]
    data['valid']=question_pairs[-26000:-20000]
    data['test']=question_pairs[-20000:]

This could be the source of the 125,306 training samples.

guxd / C-DNPG

Data Size and Split #2