nlpyang / BertSum

Code for paper Fine-tune BERT for Extractive Summarization
Apache License 2.0
1.47k stars 422 forks source link

Testing on unseen/new data #60

Open tanmaypandey7 opened 5 years ago

tanmaypandey7 commented 5 years ago

Hi, i want to test on data which is currently unseen, i.e. it doesn't have a summary in the first place. I want to know the steps to do so. I was following https://github.com/nlpyang/BertSum/issues/31 but I am hitting a dead end after formatting it similar to the sample file.

Santosh-Gupta commented 5 years ago

I am wondering if you could just leave the summaries blank, or with a simple token. I believe during inference, the summary is not used at all.

tanmaypandey7 commented 5 years ago

I am having trouble running the format_to_bert function. For example, my text looks like this:

text ="""Hafiz Saeed, 2008 Mumbai terror attacks mastermind and Jamaat-ud-Dawah (JuD) chief, was arrested on Wednesday by the Counter Terrorism Department (CTD) of Pakistan’s Punjab Province, officials said.

Saeed, who has several cases pending against him, was travelling to Gujranwala from Lahore to appear before an anti-terrorism court when the arrest took place, officials said, adding that he had been moved to an unknown location.

The JuD is believed to be the front organisation for the Lashkar-e-Taiba (LeT), which is responsible for the attacks that killed 166 people.

The U.S. Department of the Treasury has designated Saeed as a Specially Designated Global Terrorist, and the U.S., since 2012, has offered a $10 million reward for information that brings him to justice.

Under pressure from the international community, the Pakistani authorities have launched investigation into matters of the JuD and LeT regarding their holding and use of trusts to raise funds for terror-financing."""

my function to convert it to sample file:

def to_bertsum_format(text):
  tokenized_text = [word_tokenize(x) for x in text.split('.')]
  return tokenized_text

processed text looks like: [['Hafiz', 'Saeed', ',', '2008', 'Mumbai', 'terror', 'attacks', 'mastermind', 'and', 'Jamaat-ud-Dawah', '(', 'JuD', ')', 'chief', ',', 'was', 'arrested', 'on', 'Wednesday', 'by', 'the', 'Counter', 'Terrorism', 'Department', '(', 'CTD', ')', 'of', 'Pakistan', '’', 's', 'Punjab', 'Province', ',', 'officials', 'said'], ['Saeed', ',', 'who', 'has', 'several', 'cases', 'pending', 'against', 'him', ',', 'was', 'travelling', 'to', 'Gujranwala', 'from', 'Lahore', 'to', 'appear', 'before', 'an', 'anti-terrorism', 'court', 'when', 'the', 'arrest', 'took', 'place', ',', 'officials', 'said', ',', 'adding', 'that', 'he', 'had', 'been', 'moved', 'to', 'an', 'unknown', 'location'], ['The', 'JuD', 'is', 'believed', 'to', 'be', 'the', 'front', 'organisation', 'for', 'the', 'Lashkar-e-Taiba', '(', 'LeT', ')', ',', 'which', 'is', 'responsible', 'for', 'the', 'attacks', 'that', 'killed', '166', 'people'], ['The', 'U'], ['S'], ['Department', 'of', 'the', 'Treasury', 'has', 'designated', 'Saeed', 'as', 'a', 'Specially', 'Designated', 'Global', 'Terrorist', ',', 'and', 'the', 'U'], ['S'], [',', 'since', '2012', ',', 'has', 'offered', 'a', '$', '10', 'million', 'reward', 'for', 'information', 'that', 'brings', 'him', 'to', 'justice'], ['Under', 'pressure', 'from', 'the', 'international', 'community', ',', 'the', 'Pakistani', 'authorities', 'have', 'launched', 'investigation', 'into', 'matters', 'of', 'the', 'JuD', 'and', 'LeT', 'regarding', 'their', 'holding', 'and', 'use', 'of', 'trusts', 'to', 'raise', 'funds', 'for', 'terror-financing'], []]

The format_to_bert function is here which needs some arguments:

def format_to_bert(args):
    if (args.dataset != ''):
        datasets = [args.dataset]
    else:
        datasets = ['train', 'valid', 'test']
    for corpus_type in datasets:
        a_lst = []
        for json_f in glob.glob(pjoin(args.raw_path, '*' + corpus_type + '.*.json')):
            real_name = json_f.split('/')[-1]
            a_lst.append((json_f, args, pjoin(args.save_path, real_name.replace('json', 'bert.pt'))))
        print(a_lst)
        pool = Pool(args.n_cpus)
        for d in pool.imap(_format_to_bert, a_lst):
            pass

        pool.close()
        pool.join()

I am not sure what type of arguments it needs..and what do i have to do next. Any help would be appreciated.

Santosh-Gupta commented 5 years ago

The text you're sending to format_to_bert doesn't seem to be in the right format.

Check out the sample format here. https://github.com/nlpyang/BertSum/issues/61

I noticed your sample text didn't' have a "@Highlight " token, I think it's worth a try to add it in.

An issue you may run into later is For preprocessing in format_to_lines mode, it asks for a MAP_PATH which is the directory containing the urls files (../urls)

A sample text of the urls files in here

https://raw.githubusercontent.com/nlpyang/BertSum/master/urls/mapping_test.txt

Never, I am using new data which doesn't have a urls file.

Did anyone find a workaround?

nlpyang commented 5 years ago

Hi @Santosh-Gupta,

The format_to_lines function is specially designed for the CNNDM dataset, where it has @highlight symbols and url mappings. It makes no sense with your own data.

Again, I suggest you to skip this function and run format_to_bert.

The sample file for format_to_bert is in the json_data directory

nlpyang commented 5 years ago

Hi @tanmaypandey7,

Now, this code does not do what you want. But you can just use some fake reference summaries.

To run format_to_bert, please format your data as the sample file in json_data directory.

tanmaypandey7 commented 5 years ago

Hi ,my file is this:

[{'src': [['Hafiz', 'Saeed', ',', '2008', 'Mumbai', 'terror', 'attacks', 'mastermind', 'and', 'Jamaat-ud-Dawah', '(', 'JuD', ')', 'chief', ',', 'was', 'arrested', 'on', 'Wednesday', 'by', 'the', 'Counter', 'Terrorism', 'Department', '(', 'CTD', ')', 'of', 'Pakistan', '’', 's', 'Punjab', 'Province', ',', 'officials', 'said'], ['Saeed', ',', 'who', 'has', 'several', 'cases', 'pending', 'against', 'him', ',', 'was', 'travelling', 'to', 'Gujranwala', 'from', 'Lahore', 'to', 'appear', 'before', 'an', 'anti-terrorism', 'court', 'when', 'the', 'arrest', 'took', 'place', ',', 'officials', 'said', ',', 'adding', 'that', 'he', 'had', 'been', 'moved', 'to', 'an', 'unknown', 'location'], ['The', 'JuD', 'is', 'believed', 'to', 'be', 'the', 'front', 'organisation', 'for', 'the', 'Lashkar-e-Taiba', '(', 'LeT', ')', ',', 'which', 'is', 'responsible', 'for', 'the', 'attacks', 'that', 'killed', '166', 'people'], ['The', 'U'], ['S'], ['Department', 'of', 'the', 'Treasury', 'has', 'designated', 'Saeed', 'as', 'a', 'Specially', 'Designated', 'Global', 'Terrorist', ',', 'and', 'the', 'U'], ['S'], [',', 'since', '2012', ',', 'has', 'offered', 'a', '$', '10', 'million', 'reward', 'for', 'information', 'that', 'brings', 'him', 'to', 'justice'], ['Under', 'pressure', 'from', 'the', 'international', 'community', ',', 'the', 'Pakistani', 'authorities', 'have', 'launched', 'investigation', 'into', 'matters', 'of', 'the', 'JuD', 'and', 'LeT', 'regarding', 'their', 'holding', 'and', 'use', 'of', 'trusts', 'to', 'raise', 'funds', 'for', 'terror-financing'], []], 'tgt': [['NaN']]}]

and format_to_bert is this:

def format_to_bert(args):
    if (args.dataset != ''):
        datasets = [args.dataset]
    else:
        datasets = ['train', 'valid', 'test']
    for corpus_type in datasets:
        a_lst = []
        for json_f in glob.glob(pjoin(args.raw_path, '*' + corpus_type + '.*.json')):
            real_name = json_f.split('/')[-1]
            a_lst.append((json_f, args, pjoin(args.save_path, real_name.replace('json', 'bert.pt'))))
        print(a_lst)
        pool = Pool(args.n_cpus)
        for d in pool.imap(_format_to_bert, a_lst):
            pass

        pool.close()
        pool.join()

My python is rusty..could you tell me how to run the function by providing arguments to it?

nlpyang commented 5 years ago

why could not run step 5 in readme? this function is not designed to be used as an api.

lee2015new commented 4 years ago

Hi ,my file is this:

[{'src': [['Hafiz', 'Saeed', ',', '2008', 'Mumbai', 'terror', 'attacks', 'mastermind', 'and', 'Jamaat-ud-Dawah', '(', 'JuD', ')', 'chief', ',', 'was', 'arrested', 'on', 'Wednesday', 'by', 'the', 'Counter', 'Terrorism', 'Department', '(', 'CTD', ')', 'of', 'Pakistan', '’', 's', 'Punjab', 'Province', ',', 'officials', 'said'], ['Saeed', ',', 'who', 'has', 'several', 'cases', 'pending', 'against', 'him', ',', 'was', 'travelling', 'to', 'Gujranwala', 'from', 'Lahore', 'to', 'appear', 'before', 'an', 'anti-terrorism', 'court', 'when', 'the', 'arrest', 'took', 'place', ',', 'officials', 'said', ',', 'adding', 'that', 'he', 'had', 'been', 'moved', 'to', 'an', 'unknown', 'location'], ['The', 'JuD', 'is', 'believed', 'to', 'be', 'the', 'front', 'organisation', 'for', 'the', 'Lashkar-e-Taiba', '(', 'LeT', ')', ',', 'which', 'is', 'responsible', 'for', 'the', 'attacks', 'that', 'killed', '166', 'people'], ['The', 'U'], ['S'], ['Department', 'of', 'the', 'Treasury', 'has', 'designated', 'Saeed', 'as', 'a', 'Specially', 'Designated', 'Global', 'Terrorist', ',', 'and', 'the', 'U'], ['S'], [',', 'since', '2012', ',', 'has', 'offered', 'a', '$', '10', 'million', 'reward', 'for', 'information', 'that', 'brings', 'him', 'to', 'justice'], ['Under', 'pressure', 'from', 'the', 'international', 'community', ',', 'the', 'Pakistani', 'authorities', 'have', 'launched', 'investigation', 'into', 'matters', 'of', 'the', 'JuD', 'and', 'LeT', 'regarding', 'their', 'holding', 'and', 'use', 'of', 'trusts', 'to', 'raise', 'funds', 'for', 'terror-financing'], []], 'tgt': [['NaN']]}]

and format_to_bert is this:

def format_to_bert(args):
    if (args.dataset != ''):
        datasets = [args.dataset]
    else:
        datasets = ['train', 'valid', 'test']
    for corpus_type in datasets:
        a_lst = []
        for json_f in glob.glob(pjoin(args.raw_path, '*' + corpus_type + '.*.json')):
            real_name = json_f.split('/')[-1]
            a_lst.append((json_f, args, pjoin(args.save_path, real_name.replace('json', 'bert.pt'))))
        print(a_lst)
        pool = Pool(args.n_cpus)
        for d in pool.imap(_format_to_bert, a_lst):
            pass

        pool.close()
        pool.join()

My python is rusty..could you tell me how to run the function by providing arguments to it?

Hi, you should use below format_to_lines_2() function to format data to json folder without setting the MAP_PATH.

def format_to_lines_2(args):
    train_files = []
    for f in glob.glob(pjoin(args.raw_path, '*.json')):
        train_files.append(f)
    print('train_files:',train_files)
    corpora = {'train': train_files}
    for corpus_type in ['train']:
        a_lst = [(f, args) for f in corpora[corpus_type]]
        print('a_lst:', a_lst)
        pool = Pool(args.n_cpus)
        dataset = []
        p_ct = 0
        for d in pool.imap_unordered(_format_to_lines, a_lst):
            dataset.append(d)
            print('dataset:', dataset)
            pt_file = "{:s}/{:s}.{:d}.json".format(args.save_path, corpus_type, p_ct)
            print('pt_file:', pt_file)
            with open(pt_file, 'w') as save:
                # save.write('\n'.join(dataset))
                save.write(json.dumps(dataset))
                print('save success:')
                p_ct += 1
                dataset = []