microsoft / ANCE

A novel embedding training algorithm leveraging ANN search and achieved SOTA retrieval on Trec DL 2019 and OpenQA benchmarks
MIT License
359 stars 50 forks source link

data preprocess and inference #4

Closed zkt12 closed 4 years ago

zkt12 commented 4 years ago

Hi,

I download the collectionandqueries.tar.gz, extract the files to data/msmarco/ and run

python data/msmarco_data.py --data_dir data/msmarco/ --out_data_dir data/msmarco_preprocessed --model_type rdot_nll --model_name_or_path roberta-base --max_seq_length 512 --data_type 1. 

the following is returned:

...
Process Process-55:
Traceback (most recent call last):
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home3/zhangkaitao/ANCE/utils/util.py", line 339, in tokenize_to_file
    with open(in_path, 'r', encoding='utf-8') if in_path[-2:] != "gz" else gzip.open(in_path, 'rt', encoding='utf8') as in_f,\
FileNotFoundError: [Errno 2] No such file or directory: 'data/msmarco/queries.train.shuf.tsv'
start merging splits
Traceback (most recent call last):
  File "data/msmarco_data.py", line 436, in <module>
    main()
  File "data/msmarco_data.py", line 432, in main
    preprocess(args)
  File "data/msmarco_data.py", line 212, in preprocess
    "train-qrel.tsv")
  File "data/msmarco_data.py", line 66, in write_query_rel
    out_query_path, 32, 8 + 4 + args.max_query_length * 4):
  File "/home3/zhangkaitao/ANCE/utils/util.py", line 246, in numbered_byte_file_generator
    with open('{}_split{}'.format(base_path, i), 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/msmarco_preprocessed/train-query_split0'

Then I download the passage_ance_firstP_checkpoint, change the path in run_ann_data_gen.sh and run

sh run_ann_data_gen.sh

and get this:

07/10/2020 21:56:35 - WARNING - __main__ -   Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True
Traceback (most recent call last):
  File "../drivers/run_ann_data_gen.py", line 698, in <module>
    main()
  File "../drivers/run_ann_data_gen.py", line 694, in main
    ann_data_gen(args)
  File "../drivers/run_ann_data_gen.py", line 662, in ann_data_gen
    training_positive_id, dev_positive_id = load_positive_ids(args)
  File "../drivers/run_ann_data_gen.py", line 79, in load_positive_ids
    with open(query_positive_id_path, 'r', encoding='utf8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '$../data/msmarco_preprocessed/train-qrel.tsv'
07/10/2020 21:56:35 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True
07/10/2020 21:56:35 - INFO - __main__ -   starting output number 0
07/10/2020 21:56:35 - INFO - __main__ -   Loading query_2_pos_docid
Traceback (most recent call last):
  File "../drivers/run_ann_data_gen.py", line 698, in <module>
    main()
  File "../drivers/run_ann_data_gen.py", line 694, in main
    ann_data_gen(args)
  File "../drivers/run_ann_data_gen.py", line 662, in ann_data_gen
    training_positive_id, dev_positive_id = load_positive_ids(args)
  File "../drivers/run_ann_data_gen.py", line 79, in load_positive_ids
    with open(query_positive_id_path, 'r', encoding='utf8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '$../data/msmarco_preprocessed/train-qrel.tsv'
Traceback (most recent call last):
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zhangkaitao/.conda/envs/new/bin/python', '-u', '../drivers/run_ann_data_gen.py', '--local_rank=3', '--training_dir', '../data/msmarco/OSPass512/', '--init_model_dir', '../data/Passage_ANCE_FirstP_Checkpoint/', '--model_type', 'rdot_nll', '--output_dir', '../data/msmarco/OSPass512/ann_data/', '--cache_dir', '../data/msmarco/OSPass512/ann_data/cache/', '--data_dir', '$../data/msmarco_preprocessed/', '--max_seq_length', '512', '--per_gpu_eval_batch_size', '16', '--topk_training', '200', '--negative_sample', '20']' returned non-zero exit status 1.

Then I move the qrels.train.tsv in data/msmarco/ to the preprocessed folder, change its name to train-qrel.tsv, but it doesn't help. Finally I try

sh run_inference.sh

and get this:

07/10/2020 22:07:51 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True
07/10/2020 22:07:51 - INFO - __main__ -   starting output number 0
07/10/2020 22:07:51 - INFO - __main__ -   Loading query_2_pos_docid
Traceback (most recent call last):
  File "../drivers/run_ann_data_gen.py", line 698, in <module>
    main()
  File "../drivers/run_ann_data_gen.py", line 694, in main
    ann_data_gen(args)
  File "../drivers/run_ann_data_gen.py", line 662, in ann_data_gen
    training_positive_id, dev_positive_id = load_positive_ids(args)
  File "../drivers/run_ann_data_gen.py", line 79, in load_positive_ids
    with open(query_positive_id_path, 'r', encoding='utf8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '$../data/msmarco_preprocessed/train-qrel.tsv'
Traceback (most recent call last):
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zhangkaitao/.conda/envs/new/bin/python', '-u', '../drivers/run_ann_data_gen.py', '--local_rank=3', '--training_dir', '../data/Passage_ANCE_FirstP_Checkpoint/', '--init_model_dir', '../data/Passage_ANCE_FirstP_Checkpoint/', '--model_type', 'rdot_nll', '--output_dir', '../data/msmarco/OSPass512/ann_data_inf/', '--cache_dir', '../data/msmarco/OSPass512/ann_data_inf/cache/', '--data_dir', '$../data/msmarco_preprocessed/', '--max_seq_length', '512', '--per_gpu_eval_batch_size', '16', '--topk_training', '200', '--negative_sample', '20', '--end_output_num', '0', '--inference']' returned non-zero exit status 1.

Could you help me with this? Thank you:)

jialliu commented 4 years ago

Could you change the file name in msmarco_data.py from queries.train.shuf.tsv to queries.train.tsv ? They are the same thing, we just shuffled the queries. Or you can change the downloaded file name to queries.train.shuf.tsv.

After you generated the data, try inference again.

zkt12 commented 4 years ago

Thank you, it works.