google-research-datasets / natural-questions

Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems.
Apache License 2.0
916 stars 151 forks source link

Error when loading gold file #4

Closed valsworthen closed 5 years ago

valsworthen commented 5 years ago

Hello,

I am trying to generate predictions from the sample of the development set provided here. However, the script fails upon loading the gold data.

Input: python natural_questions/make_test_data.py --gold_path sample/v1.0_sample_nq-dev-sample.jsonl.gz --output_path whatever

Error:

I0201 17:10:07.177878 140447407568704 eval_utils.py:261] parsing sample/v1.0_sample_nq-dev-sample.jsonl.gz ..... 
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "XXX/miniconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "XXX/miniconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "XXX/natural-questions/eval_utils.py", line 264, in read_annotation_from_one_split
    for line in input_file:
  File "XXX/miniconda3/lib/python3.7/gzip.py", line 374, in readline
    return self._buffer.readline(size)
  File "XXX/miniconda3/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "XXX/miniconda3/lib/python3.7/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "XXX/miniconda3/lib/python3.7/gzip.py", line 406, in _read_gzip_header
    magic = self._fp.read(2)
  File "XXX/miniconda3/lib/python3.7/gzip.py", line 91, in read
    self.file.read(size-self._length+read)
  File "XXX/miniconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "natural-questions/make_test_data.py", line 120, in <module>
    app.run(main)
  File "XXX/miniconda3/lib/python3.7/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "XXX/miniconda3/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "natural-questions/make_test_data.py", line 53, in main
    n_threads=FLAGS.num_threads)
  File "XXX/natural-questions/eval_utils.py", line 303, in read_annotation
    dict_list = pool.map(read_annotation_from_one_split, input_paths)
  File "XXX/miniconda3/lib/python3.7/multiprocessing/pool.py", line 290, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "XXX/miniconda3/lib/python3.7/multiprocessing/pool.py", line 683, in get
    raise self._value

Specifically it fails at line 264 of eval_utils.py when trying to iterate over the gold file.

Can you help me solve this issue?

Thanks a lot!

bozheng-hit commented 5 years ago

I also have the same problem. The following operations may work:

  1. Replace all the iteritems() with items() in nq.eval.py.
  2. Change open(gzipped_input_file) to open(gzipped_input_file, "rb") in eval_utils.py.
valsworthen commented 5 years ago

Indeed this fixes the problem, thanks.