microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.1k stars 2.55k forks source link

markuplm: prepare_data.py stuck at 1% after wring university-usnews-2000.pickle #593

Open haymant opened 2 years ago

haymant commented 2 years ago

Describe the bug Model I am using (UniLM, MiniLM, LayoutLM ...): markuplm/SWDE

The problem arises when using:

A clear and concise description of what the bug is. prepare_data.py stuck at 1% after writng university-usnews-2000.pickle

To Reproduce Steps to reproduce the behavior:

  1. Following example scripts in markuplm to generate data, after pack data and then prepare data
  2. python prepare_data.py \
        --input_groundtruth_path ../../../SWDE/sourceCode/groundtruth \
    > --input_pickle_path ../../../SWDE/sourceCode/swde.pickle \
    > --output_data_path ../../../SWDE/sourceCode/
  3. prepare_data.py stucked after outputed:
    Loading HTML data: 100%|████████████████████████████████████| 124291/124291 [00:00<00:00, 2682664.98it/s]
    Processing usnews:  51%|███████████████████████                      | 1027/2000 [00:27<00:26, 37.40it/s]
    Vertical: university; Website: usnews; fixed_nodes: 446; variable_nodes: 84/2000 [00:27<00:25, 38.15it/s]
    Writing the processed first 2000 pages of university-usnews into ../../../SWDE/sourceCode/university-usnews-2000.pickle
    Processing swde-data:   1%|▍                                   | 1/80 [1:54:35<150:52:41, 6875.46s/it]

Expected behavior A clear and concise description of what you expected to happen.

CocoZzzzz commented 2 years ago

I met the same problem, could you please tell me how to fix it if you solved it . Thx!

dcvx commented 1 year ago

I use single thread in prepare_data.py main() to fix it.

def main(_):
    if not os.path.exists(FLAGS.output_data_path):
        os.makedirs(FLAGS.output_data_path)

    # args_list = []

    # vertical_to_websites_map = constants.VERTICAL_WEBSITES
    # verticals = vertical_to_websites_map.keys()
    # for vertical in verticals:
    #     websites = vertical_to_websites_map[vertical]
    #     for website in websites:
    #         args_list.append((vertical, website))

    # num_cores = int(mp.cpu_count()/2)

    # with mp.Pool(num_cores) as pool, tqdm(total=len(args_list), desc="Processing swde-data") as t:
    #     for res in pool.imap_unordered(generate_nodes_seq_and_write_to_file, args_list):
    #         t.update()

    # use single thread
    vertical_to_websites_map = constants.VERTICAL_WEBSITES
    verticals = vertical_to_websites_map.keys()
    for vertical in verticals:
        websites = vertical_to_websites_map[vertical]
        for website in websites:
            print(f"start process main().generate_nodes_seq_and_write_to_file({vertical},{website})")
            generate_nodes_seq_and_write_to_file((vertical, website))